ENERGY-EFFICIENT COARSE-GRAIN OUT-OF-ORDER EXECUTION ...bp863pb8596/thesis_adobe... · zoei, and Alborz Bejnood. I would like to sincerely thank my extended family, Soraya, Hosein,

ENERGY-EFFICIENT

COARSE-GRAIN OUT-OF-ORDER EXECUTION

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF ELECTRICAL

ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Milad Mohammadi

August 2015

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/bp863pb8596

© 2015 by Milad Mohammadi. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/bp863pb8596

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Bill Dally, Primary Adviser


Alex Aiken


Christos Kozyrakis


Tor Aamodt

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Preface

Throughout the past decade, energy e�cient computing has been a major problem

in the computer system space from device technologies, to chip design, to computer

architecture and software systems. Today, we have more smartphone devices in the

hands of consumers than ever before, and we generate more data on the internet than

ever before. This trend implies that moving forward, building energy e�cient systems

remains an increasingly significant challenge both in the mobile industry and in the

server industry. Building energy e�cient processors is at the forefront of solving these

technological challenges.

This doctoral dissertation describes the Coarse-Grain Out-of-Order (CG-OoO)

processor architecture. Block-level code processing is at the heart of the CG-OoO

architecture; CG-OoO speculates, fetches, schedules, and commits code at block-level

granularity. CG-OoO eliminates unnecessary accesses to energy consuming tables,

and turns large tables into smaller and distributed tables that are cheaper to access.

CG-OoO leverages compiler-level code optimizations to deliver more e�cient static

code. It exploits instruction-level parallelism and block-level parallelism. CG-OoO

introduces the Skipahead model which is a complexity e↵ective, limited out-of-order

instruction issue model. It is an energy-performance proportional design that can

scale according to the program load. Through the energy e�ciency techniques applied

to the compiler and processor pipeline stages, CG-OoO delivers over 50% energy

reduction at the performance of the baseline out-of-order processor.

iv

Acknowledgements

The work presented in this doctoral dissertation could not have been possible without

the support, care, and guidance of many individuals. I would like to extend my sincere

appreciation to the following people.

I would like to express my gratitude to my wonderful advisor and teacher, Professor

William J. Dally, whose vision and intuition has provided me guidance throughout my

graduate studies. I greatly benefited from Professor Dally’s endless optimism toward

solving challenging problems and his boundless enthusiasm for innovation.

I would like to sincerely thank my second advisor, Professor Tor M. Aamodt, whose

wealth of knowledge and depth of intuition provided me the tools I needed to advance

my research. Special thanks to Professor Christos Kozyrakis for his encouragements

during the time I was entering the field of computer architecture, and for accepting

to be part of my dissertation reading committee. I had the pleasure of having him as

my lecturer for three computer architecture courses that provided me the scientific

foundation to pursue my PhD research in this field. I also thank Professor Alex

Aiken for accepting to be part of my dissertation reading committee. I had the distinct

pleasure of being his student in the Stanford Parallel Computing class. Special thanks

to Professor Mark Horowitz for providing valuable feedbacks on my thesis during the

final year of my PhD studies.

I would like to acknowledge and thank my fantastic friends and colleagues in

the CVA lab: Curt Harting, Ted Jiang, Daniel Becker, George Michelogiannakis,

James Chen, Subhasis Das, Nic McDonald, Song Han, Albert Ng, Vishal Parikh,

Camilo Moreno, Yatish Turakhia. I would also like to thank the wonderful CVA

administrators, Sue George and Uma Mulukutla who have been always kind and

v

helpful to me. Special thanks to Curt Harting and James Chen for their mentorship

durnig the first half of my tenure at the CVA lab. Also, special thanks to my friend,

Subhasis Das, for being a passionate and smart labmate (with a great sense of humor),

especially during our collaboration on building the energy model for this thesis.

I would like to thank my friends Behnam Montazeri, Christina Delimitrou, Camilo

Moreno, Ardavan Pedram, and Nicole Celeste Rodia who engaged with my research

and provided valuable feedbacks. I would also like to thank my wonderful friends

in the Stanford Persian community whose presence made Stanford feel like home. I

specially thank my friends with whom I ran the Stanford Persian Student Association

(PSA) board, Ehsan Sadeghipour, Reza Mirghaderi, Dorna Kashef, Alireza Sharafat,

Parnian Zargham, Maryam Daneshi, Pooya Ehsani, Shahab Mirjalili, Masoud Tava-

zoei, and Alborz Bejnood.

I would like to sincerely thank my extended family, Soraya, Hosein, Pedram, and

Payam Lajevardi. Soraya and Hossein have been nothing short of my second parents

during the years I lived away from home. I thank Pedram and Payam for the numerous

brotherly advises and encouragements they gave me thorughout my post-secondary

education.

I thank my sisters, Mojdeh and Yasamin Mohammadi, and my brother-in-law,

Farhad Fereidooni, for their love and support throughout the years. I also thank

them for giving my parents the care and attention they deserved during my 11-year

absence from home.

I thank my wonderful parent-in-laws, Morteza and Mehri Mohammadgiahi whose

love and emotional support has always brought hope and strength to my family.

I would like to sincerely thank my exceptional parents, Amirhossein and Soheila

Mohammadi who enabled and supported me to pursue my dream of becoming a

scientist. Their unconditional love for me and their enormous sacrifices provided me

the courage to pursue my dreams. I find myself in eternal debt to them.

I especially thank my extraordinary wife, my best friend, Marjan, whose continual

selfless support, at my busiest and toughest moments, helped me focus on research,

and whose unshakable confidence in me, even at times when I doubted my ability to

deliver, gave me courage to march forward.

vi

Dedicated to my wife and my best freiend, Marjan.

vii

Contents

Preface iv

Acknowledgements v

1 Introduction 1

1.1 Collaborations and Other Contributions . . . . . . . . . . . . . . . . 2

1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 OoO Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Coarse-Grain Out-of-Order Execution Model 12

3.1 Coarse-Grain Out-of-Order Execution . . . . . . . . . . . . . . . . . . 12

3.2 Constructing Blocks for CG-OoO . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Block Boundary Annotation . . . . . . . . . . . . . . . . . . . 15

3.2.2 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Program Execution Flow in CG-OoO . . . . . . . . . . . . . . 18

3.3.2 Control Speculation in CG-OoO . . . . . . . . . . . . . . . . . 21

3.4 Squash Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4.1 Squash due to Control Mis-speculation . . . . . . . . . . . . . 22

3.4.2 Squash due to Memory Mis-speculation . . . . . . . . . . . . . 23

viii

3.5 Static Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . 24

3.6 Sources of Parallelism in Cg-OoO . . . . . . . . . . . . . . . . . . . . 24

3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 System Architecture 27

4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.2 Fetch Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.3 Decode Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.4 Register Rename / Block Allocation Stage . . . . . . . . . . . 37

4.2.5 Instruction Steer Stage . . . . . . . . . . . . . . . . . . . . . . 42

4.2.6 Front-end Examples . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.7 Issue Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.8 Memory Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.9 Write-Back & Commit Stage . . . . . . . . . . . . . . . . . . . 57

4.3 Squash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Methodology 62

5.1 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Functional Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Timing Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Coarse-Grain Out-of-Order Evaluation 85

6.1 Sources of Energy Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2 CG-OoO Design Characterization . . . . . . . . . . . . . . . . . . . . 86

6.3 CG-OoO Performance Analysis . . . . . . . . . . . . . . . . . . . . . 88

6.4 CG-OoO Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . 92

6.4.1 Block Level Branch Prediction . . . . . . . . . . . . . . . . . . 93

ix

6.4.2 Register File Hierarchy . . . . . . . . . . . . . . . . . . . . . . 93

6.4.3 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . 98

6.4.4 Block Re-Order Bu↵er . . . . . . . . . . . . . . . . . . . . . . 99

6.5 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Related Work 104

7.1 CG-OoO Design Features . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2 CG-OoO Energy E�ciency Features . . . . . . . . . . . . . . . . . . . 109

7.2.1 Degree of Coarse Granularity . . . . . . . . . . . . . . . . . . 110

7.2.2 Front-end Energy E�ciency . . . . . . . . . . . . . . . . . . . 110

7.2.3 Back-end Energy E�ciency . . . . . . . . . . . . . . . . . . . 112

7.3 OoO Energy E�ciency Arguments . . . . . . . . . . . . . . . . . . . 113

7.4 Energy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.5 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8 Conclusion 115

8.1 Summary of Thesis Contributions . . . . . . . . . . . . . . . . . . . . 115

8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 117

Bibliography 119

x

List of Tables

5.1 System Parameters Shared Between All Core Architectures . . . . . . 71

5.2 System Parameters for Each Individual Core . . . . . . . . . . . . . . 72

7.1 Related Work: High Level Design Features Comparison . . . . . . . . 106

7.2 Related Work: Micro-architectural Features Comparison . . . . . . . 111

xi

List of Figures

2.1 OoO, InO Execution Model Example . . . . . . . . . . . . . . . . . . 7

2.2 Enery and Performance Overhead of OoO vs. InO . . . . . . . . . . . 8

2.3 OoO Architecture Pipeline Model . . . . . . . . . . . . . . . . . . . . 9

3.1 Block-Level Dynamic Execution Model . . . . . . . . . . . . . . . . . 14

3.2 The head instruction format. . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 do-while Loop Example . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 CG-OoO Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 CG-OoO Instruction Flow Example . . . . . . . . . . . . . . . . . . . 20

3.6 Instruction Flow Through the CG-OoO Front-end Example . . . . . . 22

3.7 CG-OoO Instruction Flow Squash Example . . . . . . . . . . . . . . 23

4.1 CG-OoO Detailed Micro-architecture . . . . . . . . . . . . . . . . . . 28

4.2 Branch Prediction Unit (BPU) micro-architecture . . . . . . . . . . . 30

4.3 Instruction Cache Fetch Example . . . . . . . . . . . . . . . . . . . . 32

4.4 An Example Code Fetch Sequence for CG-OoO . . . . . . . . . . . . 34

4.5 Logic Unit to Detect head Operations . . . . . . . . . . . . . . . . . . 36

4.6 Register Rename Bypass Logic . . . . . . . . . . . . . . . . . . . . . . 39

4.7 Block Allocation State Transition Diagram . . . . . . . . . . . . . . . 40

4.8 Block Allocation Routing Diagram . . . . . . . . . . . . . . . . . . . 41

4.9 CG-OoO Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.10 CG-OoO Fetch and Decode Examples . . . . . . . . . . . . . . . . . . 44

4.11 Block Window Components . . . . . . . . . . . . . . . . . . . . . . . 46

4.12 Instruction Queue & Head Bu↵er Entry Formats . . . . . . . . . . . . 46

xii

4.13 Instruction Allocation & Issue Pipeline Stages . . . . . . . . . . . . . 48

4.14 Instruction Queue and Head Bu↵er Micro-architecture . . . . . . . . 49

4.15 Data-Forwarding and Wakeup Models . . . . . . . . . . . . . . . . . . 50

4.16 GRF segment access demultiplexer . . . . . . . . . . . . . . . . . . . 52

4.17 Interconnection Network Connecting for EU Clusters . . . . . . . . . 53

4.18 Data Dependency Code Example . . . . . . . . . . . . . . . . . . . . 54

4.19 Head Bu↵er Micro-architecture . . . . . . . . . . . . . . . . . . . . . 56

4.20 Block Re-Order Bu↵er (BROB) Entry Format . . . . . . . . . . . . . 58

5.1 Simulation Software Infrastructure . . . . . . . . . . . . . . . . . . . 64

5.2 Compiler Software Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Functional Emulator Software Infrastructure . . . . . . . . . . . . . . 66

5.4 Wrong-Path Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Average Number of Instructions on the Wrong-Path . . . . . . . . . . 69

5.6 InO, OoO, CG-OoO Processor Pipelines . . . . . . . . . . . . . . . . 71

5.7 Squash State Transition Diagram . . . . . . . . . . . . . . . . . . . . 74

5.8 Squash Model Example . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.9 Energy Model Software Infrastructure . . . . . . . . . . . . . . . . . . 76

5.10 SPICE Energy Measurement Signal . . . . . . . . . . . . . . . . . . . 78

5.11 Energy Model Configuration Example . . . . . . . . . . . . . . . . . . 79

5.12 SRAM Table Energy & Area - Size Sweep . . . . . . . . . . . . . . . 80

5.13 SRAM Table Energy & Area - Port Sweep . . . . . . . . . . . . . . . 80

5.14 RAM & CAM Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.15 Flip-Flop in SPICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.1 OoO Energy Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2 OoO Energy Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.3 Dynamic Code Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . 88

6.4 CG-OoO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.5 4-Wide CG-OoO Performance . . . . . . . . . . . . . . . . . . . . . . 90

6.6 List Scheduling E↵ect on Performance . . . . . . . . . . . . . . . . . 90

6.7 Skipahead Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

xiii

6.8 Processor Widths E↵ect on Performance . . . . . . . . . . . . . . . . 92

6.9 Normalized Processors Energy . . . . . . . . . . . . . . . . . . . . . . 94

6.10 Energy-Delay Product Inverse . . . . . . . . . . . . . . . . . . . . . . 95

6.11 Static & Dynamic Energy Breakdown . . . . . . . . . . . . . . . . . . 95

6.12 CG-OoO BPU Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.13 Normalized Register File Energy . . . . . . . . . . . . . . . . . . . . . 97

6.14 Register Renaming Energy . . . . . . . . . . . . . . . . . . . . . . . . 98

6.15 Segmented Register File Energy Trend . . . . . . . . . . . . . . . . . 99

6.16 Dynamic Scheduler Energy . . . . . . . . . . . . . . . . . . . . . . . . 100

6.17 Commit Stage Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.18 Harmonic Mean Speedup and Normalized Energy . . . . . . . . . . . 102

6.19 Normalized Power vs. Performance . . . . . . . . . . . . . . . . . . . 103

xiv

Chapter 1

Introduction

With the recent technology innovations in di↵erent fields of computer science and

computer technology including genomic, social media, online entertainment, the re-

quirements for building energy e�cient and high performance computer processors

has been increasing. In the consumer mobile space, also, the demand for building

more energy e�cient processors that help extend battery life has been a major focus

in the computer architecture community and the processor manufacturing industry.

This research project has taken a bottom-up approach in identifying the energy

ine�ciencies of existing single-threaded architectures; namely the out-of-order (OoO)

processor. The result of this study has been a processor design that addresses these

ine�ciencies while maintaining nearly the same level of performance as existing pro-

cessors in the industry. My research led me to find that the energy ine�ciency of the

OoO processor is rooted in its overall execution model design. In other words, there

is no one component in the core hardware that dominates the energy consumption.

This thesis addresses the energy problem by devising an alternative execution

model called the Coarse-Grain Out-of-Order (CG-OoO) model. The name, Coarse-

Grain, refers to the processor ability to process groups of instructions as a whole,

instead of processing instructions individually as is done in the out-of-order model.

As will be discussed in future chapters, the key to building such an energy e�cient,

high-performance, single-threaded processor is in finding a design point where the

processor complexity is nearly as simple as an in-order processor, but its ability to

1

CHAPTER 1. INTRODUCTION 2

deliver high instruction-level parallelism is paramount. Handling of instructions in

groups proves to be the essential architectural requirement for the CG-OoO processor

to deliver its superior energy e�ciency.

Multiple publications have shown as the performance capability of processors in-

creases, their energy per operation increases non-linearly [2, 55]. In other words, the

energy cost for building powerful processors is more than the obtained performance

benefit. In this work, I focus on designing the CG-OoO execution model such that

it enables much cheaper energy consumption for the same performance capability as

the OoO processor while benefiting from a linearly proportional energy-performance

scaling trade o↵.

1.1 Collaborations and Other Contributions

During my research studies at the Stanford Concurrent VLSI Architectures lab (CVA),

I worked on two research projects: the CG-OoO project which is the topic of this

thesis done in collaboration with Tor M. Aamodt and William J. Dally, and the

On-Demand Dynamic Branch Prediction (ODBP) project [36] which was done in

collaboration with Song Han, Tor M. Aamodt, and William J. Dally. The ODBP

work focused on building an energy e�cient branch prediction mechanism for out-of-

order processors that helps eliminate unnecessary accesses to the branch prediction

unit for the purpose of reducing its energy consumption and improving its prediction

accuracy.

1.2 Thesis Contributions

The main contribution of this work is the design of the energy-e�cient coarse-grain

out-of-order architecture which reaches the performance of the out-of-order execution

model with over 50% energy consumption reduction.

Given the level of complexity and novelty of this architecture research, new com-

piler, simulation, and energy modeling infrastructures have been built. The simulation

framework is targeted toward modeling the single-threaded coarse-grain out-of-order,


out-of-order, and in-order processors. It is built on top of the Pintool API [31] and

supports an integrated energy model. The energy model consists of several compo-

nents that estimate energy consumption of di↵erent hardware components. It utilizes

SPICE, Verilog, and HotSpot [23] simulations for estimating energy numbers. Addi-

tionally, a compiler back-end is built to produce energy e�cient code with an alterna-

tive Instruction Set Architecture (ISA) named the CG-OoO ISA. The new ISA di↵ers

from the x86 ISA in its additional instruction feature for supporting block level code

processing, and in its register file model. The compiler takes optimized code from gcc

and performs additional optimizations to improve the energy e�ciency of the code.

This research project revisits most of the out-of-order processor pipeline stages

and devises a design alternative that makes each stage more energy e�cient. The

following list highlights the key topics under which these energy e�cient solutions are

developed and then integrated into the CG-OoO processor model:

• A code-clustering compilation framework

• A block-level branch prediction and fetch model

• A novel instruction scheduling model called Skipahead that supports limited

out-of-order instruction issue

• A distributed register file hierarchy. In this model registers are managed through

a static and dynamic register allocation hybrid. The register file hierarchy also

enables building an energy e�cient register rename unit.

• A new re-order bu↵er design that tracks program order and handles squash

events at block granularity

• A distributed and clustered execution model that enables proportional energy-

performance scaling.

The evaluation framework used for this work is in-house. I have built a simulation

software infrastructure based on the Pintool API which consists of three major com-

ponents: a detailed energy model, a functional emulator running code on the native


processor and instrumenting it for later use by the third component of this tool, the

detailed timing simulator. Additionally, this project consists of a code processing

component that required developing a dedicated compiler framework to extend the

code optimizations and analysis done by gcc. This compiler reformats the x86 ISA

into an alternative ISA designed for supporting coarse-grain execution (i.e. CG-OoO

ISA). The timing simulator uses the CG-OoO ISA for performance evaluations. To

my knowledge, no publicly available simulation model, with the attributes and fea-

tures of this simulator, exists in the computer architecture community. Chapter 5

discusses the details of the compiler and the entire simulator.

1.3 Thesis Organization

The remainder of this thesis is organized as follows. Chapter 2 provides background

information on the execution model and energy versus performance properties of exist-

ing processors. It also provides the foundation for why an alternative execution model

is necessary in order to bring substantial energy e�ciency to the single-threaded,

general purpose processor models. Chapter 3 promotes the coarse-grain out-of-order

execution model via several examples, and describes the main processor architecture

features at a high level. Chapter 4 discusses the CG-OoO architecture features in

detail and describes how the processor functions. Chapter 5 discusses the evaluation

methodology of this work; since a majority of the evaluation infrastructure has been

built in-house a great deal of discussion is aimed toward presenting the details of each

building block in the compiler, the simulation infrastructure, and the energy model.

Chapter 6 evaluates the performance and energy characteristics of the coarse-grain

out-of-order model and compares them against the out-of-order and in-order proces-

sor baseline models. Chapter 7 provides an analysis of previous work in the literature

regarding performance and energy optimizations done on single-threaded processors

and highlights how the CG-OoO model is di↵erentiated from each work. Chapter 8

provides concluding remarks on this thesis.

Chapter 2

Background

This chapter introduces two common classes of processors: in-order and out-of-order.

At a high level, it introduces the design elements of the out-of-order processor that

contribute to its superior performance compared to the in-order processor, and de-

scribes the sources of energy ine�ciency associated with the out-of-order processor.

2.1 OoO Execution Model

In-order (InO) and out-of-order (OoO) processors are among the popular single-

threaded execution models in the computer architecture community. In the InO

execution model, instructions are simply executed in program order. In the OoO

execution model, however, instructions can be executed out of the original program

order while maintaining program correctness.

Figure 2.1a shows a simple dynamic code sequence consisting of eight instructions.

Figure 2.1b shows the corresponding data-dependency between the instructions where

circles are the instructions and edges are the data-dependency links between the

instructions. In this example, instruction 2 can only be executed after instruction

1 completes its execution. On the other hand, instruction 3 is free to be executed

anytime before or after instruction 1 as it has no data-dependency on 1. The numbers

on the edges indicate the cycle count for operations to generate their results.

In the case of the InO model, the program will be executed according to the

5

CHAPTER 2. BACKGROUND 6

original program sequence as shown in Figure 2.1c; since instruction 1 takes three

cycles to complete its execution, two stall cycles are introduced. The same situation

adds an additional 2-cycle delay after the issue of instruction 3. In the case of the OoO

model, the processor dynamically tracks data dependencies between instructions; it

identifies instruction 3 as an independent instruction to instruction 1 and issues 3

at cycle 3 (see Figure 2.1d). Executing 3 early eliminates three stall cycles from the

execution flow leading to 23% reduction in execution time.

Scheduling of instructions out-of-order is known as dynamic instruction schedul-

ing. To enable dynamic scheduling, the OoO processor leverages program speculation

to issue future instructions earlier. For instance, to fetch instruction 1, the OoO pro-

cessor first speculates that instruction 1 will very likely be executed in the future by

speculating on instruction 0. It also determines 3 has no data-dependency to other

in-flight operations (i.e. instructions 0 and 1). Finally, to hide the latency of instruc-

tion 1, the processor issues instruction 3 in cycle 3. It is through the combination

of dynamic instruction scheduling and speculation that the OoO processor achieves

superior performance compared to the InO processor.

McFarlin et al. [34] characterizes the benefits of the OoO processor with respect to

the InO processor and concludes 88% of the OoO performance gain comes from spec-

ulation, 10% comes from dynamic scheduling, and 2% comes from improved branch

mis-prediction recovery. Figure 2.2a depicts this performance breakdown for the

SPEC Int 2006 benchmarks. Several fundamental reasons lead to the great per-

formance advantage of the out-of-order processor; namely e↵ective false-dependency

elimination via register renaming, aggressive control speculation, accurate memory

speculation to enable memory level parallelism (MLP), and e�cient wake-up/select

logic to issue ready instructions. These features are enabled through a number of key

hardware units including the branch predictor, register renaming tables, load-store

queue unit, memory disambiguation tables, instruction queue or reservation stations,

and the re-order bu↵er; Figure 2.3 illustrates these units for a 4-wide superscalar

out-of-order processor pipeline. Accesses to tables within these units is the main

source of energy overhead associated with the OoO processor. Figure 2.2b shows


STALL

STALL

STALL

STALL

STALL

1

2

7 5

3

64

8

1. LOAD

2. ADD

3. LOAD

4. ADD

5. ADD

6. STORE

7. STORE

8. BRANCH

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

In-Order (InO) Out-of-Order (OoO)

1

2

3

4

5

6

7

8

9

10

11

12

3

1 1 3 3

11

Tim

e

(a) (b) (c) (d)

0. BRANCH

0

00

13

Basic

-blo

ck

Data-flow Graph

Not-Taken

Take

n

Figure 2.1: (a) a simple dynamic code sequence example consisting of nine assem-bly instructions. (b) the data-dependency graph corresponding to this instructionsequence. The color codes separate the data dependent subsets. The numbers on theedges indicate the number of cycles for each operation to generate its result. Thedotted gray arrows show the control flow for control instruction 0. (c) the executionschedule of instructions in a 1-wide in-order processor. (d) the execution scheduleof instructions in a 1-wide out-of-order processor. In this example, the out-of-orderschedule is three cycles faster than the in-order schedule.


57%$ 57%$

38%$

4%$1%$

0%$

20%$

40%$

60%$

80%$

100%$

InO$ OoO$

Speedu

p&Average&Performance&

(SPEC&Int&2006)&

Branch$Mis8Predic<on$Handling$

Dynamic$Scheduling$

Specula<on$

Base$

(a)

27%$ 27%$

36%$

18%$

19%$

0%$

20%$

40%$

60%$

80%$

100%$

InO$ OoO$

Normalized

+EPC

+

Normalized+Average+Energy+Per+Cycle+(SPEC+Int+2006)+

High$Frequency$Table$Access$

High$Cost$Table$Access$

Dynamic$Scheduling$

Base$

(b)

Figure 2.2: (a) The speedup of the out-of-order (OoO) processor compared to thein-order processor is broken into three key categories [34]. (b) Harmonic mean energyper cycle (EPC) of the SPEC Int 2006 benchmarks simulated on the in-order and out-of-order processors. The OoO energy overhead is divided into three main categories.

this energy overhead is broken into three categories: accesses to large tables, unneces-

sarily frequent table accesses, and dynamic instruction scheduling, which is primarily

associated with accessing the instruction window tables.

Here, I qualitatively expand on these energy consuming units. In Chapter 6, their

energy and performance profile will be discussed and quantified in detail.

The branch predictor tables enable program speculation. They allow the processor

front-end to run ahead of the back-end by fetching future instructions early. To

guarantee high back-end performance, the front-end is designed to avoid fetch stall

cycles by predicting the next fetch group1 every cycle irrespective of whether the

current fetch group holds a control operation. As will be discussed in later chapters,

the OoO model spends an excessive amount of energy on program speculation which

can be avoided by accessing the branch predictor only at control operations.

The register renaming stage is essential in eliminating runtime write-after-write

(WAW) and write-after-read (WAR) dependencies. By eliminating these false-dependencies,

register renaming allows significantly higher instruction level parallelism (ILP). The

OoO processor renames every register operand for every instruction; this results

1The group of instructions fetched via an instruction-cache access is called a fetch group. Forexample, the fetch group of a 4-wide instruction-cache can contain up to 4 instructions.


L1 Instruction Cache

L1 Data Cache

L2Cache

FetchPC

BranchPrediction

Decode

Rename

Dispatch

ROB

Instr.Window

Scheduler

EU EU EU EULSU

RegisterFile

Figure 2.3: The pipeline structure for a 4-wide out-of-order execution model. LSUrefers to the load-store unit, EU refers to the execution unit, and ROB refers to there-order bu↵er.


in a significant energy overhead. In future chapters, I show that for some regis-

ter operands, register renaming can be eliminated to reduce the energy overhead of

renaming table lookups.

The register file is one of the most commonly accessed tables in the processor.

Each instruction accesses the register file 2 or 3 times depending on the number of its

operands.2 Also, superscalar processors issue multiple instructions per cycle requiring

a large number of ports to access data for all instructions.3 In addition, the register

file is usually 4x larger than the number of architectural registers in order to support

register renaming. The larger the size of the register file and the number ports, the

higher the access energy. In future chapters, an energy e�cient register file hierarchy

model is presented to reduce the register file energy.

The instruction scheduler is the major energy consuming component of the core

pipeline. OoO instruction scheduling consists of two main steps: instruction wakeup

and instruction select. Upon the completion of every instruction, its result is writ-

ten into the register file and forwarded to the operations waiting for it. Upon the

availability of all the source operands of an instruction, it is woken up for issue.

Each cycle, the dynamic instruction scheduler selects the n oldest ready (i.e. woken

up) instructions from the instruction queue and issues them to the available execu-

tion units (EU); here, n refers to the number of available EU’s at every cycle. The

instruction scheduler is a unified queue with a random access memory (RAM) struc-

ture that holds the static instruction information such as the op-code and immediate

value, and two content addressable memories (CAM) that hold the source operands

for each operation; the CAM tables allow the wakeup unit to search and update the

source operands of waiting instructions. In future chapters, an energy e�cient and

complexity e↵ective instruction scheduling model is introduced.

The re-order bu↵er (ROB) enables precise exceptions and maintains program or-

der by enforcing in-order instruction commit. The ROB must be accessed by every

dynamic instruction at least three times; once at the rename stage to reserve a ROB

entry, once when the operation completes execution, and once when the operation

2For example, in the MIPS ISA, instructions have at most two source operands and one destina-tion operand.

3In 4-wide OoO processors, building register files with 8 read ports and 4 write ports is common.


is to be committed.4 In future chapters, I discuss a design structure alternative for

the ROB that supports program order and precise exceptions without the need for

accessing it as frequently as the OoO model; reducing the ROB access frequency

reduces its energy consumption.

The goal of the rest of this study is to identify the contribution of di↵erent tables

to the OoO energy overhead and to devise compiler techniques as well as architectural

modifications to reduce these energy overheads. As will be discussed in future chap-

ters, reducing these energy overheads demands introducing an alternative execution

model compared to that of the OoO processor. It is expected that the new execution

model maintains support for dynamic execution and speculation, but proposes design

solutions for reducing the energy cost.

2.2 Chapter Summary

This chapter introduced the fundamental architectural elements of the OoO processor

that contribute to its performance e�ciency; namely dynamic instruction scheduling

and program speculation. It described these features are enabled through the branch

predictor, register rename, dynamic instruction scheduler, and re-order bu↵er, and

explained these units are the ones that also consume the majority of the OoO energy

overhead. The goal of this work is to devise compiler and architectural solutions

that substantially reduce the energy consumption of these units while maintaining

the same level of performance as the OoO model.

4Here, I assume the ROB does not hold intermediate register data (i.e. using a physical registerfile).

Chapter 3

Coarse-Grain Out-of-Order

Execution Model

In this chapter, I introduce an energy e�cient and high performance execution model

named Coarse-Grain Out-of-Order (CG-OoO) execution and describe it through a

number of examples. I also describe the sources of energy saving in CG-OoO. This

chapter builds the foundation for detailed discussions on the processor architecture

and its performance analysis in Chapters 4 and 6.

3.1 Coarse-Grain Out-of-Order Execution

I present an energy e�cient and high-performance single-threaded core architecture

for general-purpose computing named Coarse-Grain Out-of-Order (CG-OoO). The

key insight behind building this architecture is block-level dynamic execution. Block-

level execution has been previously studied in various contexts [19,35]. These studies

are further elaborate in Chapter 7.

In this framework a block is defined as a sequence of static instructions clus-

tered together. Each code block, in this study, is a control-flow basic-block. Block-

level dynamic execution means the branch prediction, dispatch, instruction scheduler,

operand write-back, commit, and squash units are designed to handle blocks as the

primary unit of execution (rather than instructions). In this model, the processor

12

CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 13

speculatively fetches code blocks into the execution pipeline, and allocates to each

block a separate first-in-first-out (FIFO) instruction queue called a Block Window

(BW) (Figure 3.1). The instruction scheduler checks the head of each BW in a

round-robin manner to find ready instructions to issue. Once all instructions in a

code block complete execution, the block is ready to retire.

This architecture motivates a new design model that is substantially more energy

e�cient, and that sacrifices negligible performance compared to the OoO core, dis-

cussed in Chapter 2. The following list summarizes the high-level design techniques

used in CG-OoO to save energy. It also outlines solutions to the list of energy in-

e�ciency drawbacks of the OoO design discussed in Chapter 2. The remainder of

this chapter and the future chapters focus on explaining the following techniques and

evaluating their impact on the overall energy and performance of CG-OoO.

• Small Tables: CG-OoO replaces large and centralized hardware structures

such as the instruction queue, register file, and re-order bu↵er (ROB) with

smaller, less complex, and distributed structures. For instance, as shown in

Figure 3.1, the OoO core Instruction Window is replaced with the BW’s as

decentralized FIFO bu↵ers that hold block instructions. CG-OoO replaces the

conventional Re-Order Bu↵er (ROB) with a 10⇥ smaller table called Block Re-

Order Bu↵er (BROB) that tracks code blocks (see Table 5.2). CG-OoO also

uses a novel decentralized register file hierarchy that is discussed in detail in

Chapter 4. Through reducing table sizes, the energy to access these tables is

reduced.

• Hybrid Instruction Scheduling: CG-OoO combines static and dynamic in-

struction scheduling to reduce the hardware energy consumption through reduc-

ing runtime instruction scheduling overhead. The compiler generates optimal

static list schedules for instructions within each block. As illustrated in Fig-

ure 3.1, the dynamic scheduler scans the head of its BW’s to find and issue

ready instructions.

• Reduced Table Accesses: CG-OoO reduces access to energy hungry hard-

ware units such as the register renamer and branch predictor. Compiler support


BW BW

EU EU

Scheduler

Core Backend

BW

Core Frontend

EU

Scheduler

BW

EU

Figure 3.1: Block-level dynamic execution model. BW stands for Block Window, andEU stands for Execution Unit. A BW holds operations that belong to a block of code.EU’s and BW’s are grouped together to form an execution cluster; each executioncluster is being managed by a separate scheduler. Each instruction scheduler checksthe head of its BW’s to issue ready instructions to its EU’s.


along with an energy e↵ective register file hierarchy help bypass renaming the

register operands that have short live-ranges; this technique is discussed in detail

in Chapter 4. Compiler support also helps eliminate branch prediction lookups

done by non-control operations. Section 3.3.2 describes how the CG-OoO core

minimizes the branch prediction lookup tra�c.

3.2 Constructing Blocks for CG-OoO

In the CG-OoO model, dynamic scheduling is done at code block granularity (rather

than instruction granularity) where each code block consists of a group of instructions

with an optimized static schedule. Block-level dynamic scheduling requires special

support from the compiler to cluster instructions. In this section I explain how the

compiler clusters instructions into code blocks. Recall that each code block, in this

work, is a control-flow basic-block.

3.2.1 Block Boundary Annotation

CG-OoO requires means to identify blocks of instructions. The compiler generates

a special instruction called head to specify the start of each code block. Upon each

head fetch, the front-end allocates an available BW for the new block of code and

groups instructions into code blocks by steering all upcoming instructions to the BW.

Figure 3.2 shows the head instruction format. Its fields are:

• Opcode: 6-bit opcode value

• HasCtrl: 1-bit value indicating if the code block holds a control instruction

• BlkSize: 5-bit value indicating the number of operations in the block

• Immediate: 52 least significant PC bits of the control instruction in each code

block

Figure 3.3 shows an example code highlighting head; it shows the number of

instructions, excluding head is six, HasCtrl = 0’b1 indicating a control operation


Opcode Fall-Through Block Offsethead BlkSize063 58 57 56 52 51

HasCtrl

Figure 3.2: The head instruction format. HasCtrl is a 1-bit value indicating if a validvalue exists in the immediate field of the instruction. BlkSize holds the number ofstatic instructions in the block, excluding head. The immediate field is the leastsignificant 52 bits of the control operation address at the end of the block. Theimmediate is used to hash into the BPU tables.

exists at the end of the block, and the least significant 52 bits of the control operation

address (i.e. bne).

In case a block does not end with a control instruction (i.e. it has a fall-through

path only), HasCtrl is set to 0’b0 to disable a Branch Prediction Unit (BPU) lookup

and instead continue fetching the fall-through block which is stored immediately after

this block.

The head instruction serves a few key purposes:

1. It specifies the boundaries of code blocks to the processor at runtime.

2. It holds the number of block instructions in BlkSize. As will be discussed in

Section 3.3, this number is used to track when all instructions of the block com-

plete their execution, making the block ready to retire. Five bits are allocated

to BlkSize as the compiler assumes the largest code block can be 32 instruc-

tions. The compiler breaks larger blocks into multiple smaller blocks each with

at most 32 operations. Bird et al. [5] shows the average size of basic-blocks

in the SPEC CPU 2006 integer and floating-point benchmarks are 5 and 17

operations respectively making 32 instructions / block a su�ciently large size.

3. The Immediate field is used to hash into the BPU tables during the prediction

stage to predict the next code block. The motivation behind looking up the

BPU using head rather than control operations is described in Section 3.3.


int index = -1;

do {index += 1;value_array [index] -= 1;

} while (index != MAX_ITER);

LOOP:0xF00 head 0b’1, 0x6, 0x380xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 sub r1, r1, #10xF28 sw r0, r10xF30 bne r2, r3, LOOP

Figure 3.3: A simple do-while loop that updates the values of an array (left). Theassembly version of the program (right) shows the head instruction as the first in-struction in the basic-block.

3.2.2 Code Generation

To build the CG-OoO processor architecture, two code generation approaches are

possible. The first approach is to embed code blocking semantics into the program

binary during the static compilation process. The second approach is to construct

code blocks through runtime dynamic code optimization.

In case of static code optimization, which to date, has been the common method,

the addition of head to any given Instruction Set Architecture (ISA) is the only

necessary addition; this may be an acceptable change for an energy aware architecture

provided the amount of energy saving opportunities it can deliver.

In case of hardware level dynamic code optimization, for architectures like the

NVIDIA Project Denver [1, 6, 12, 15], no compiler-level ISA modification will be re-

quired; this is because code block detection and annotation can be done dynamically

during the dynamic program profiling stage with almost no extra data-collection cost

as the hardware profiler simply annotates blocks using the information it already

collects on control operations. Such processors dynamically post-process the original

program ISA (e.g. ARM or X86) into a low-level micro-code ISA; the micro-codes

are scheduled into dynamic code sequences that execute with substantially higher

performance and energy e�ciency than the conventional OoO processors.

The choice between the above two alternatives, in practice, depends on the re-

quirements and constraints of a particular processor architecture design. In this work,

I use the former alternative.


3.3 Execution Model

This section discusses the execution model of CG-OoO when instructions are grouped

into basic-blocks.

Figure 3.4.A shows the control flow graph of a simple do-while loop that walks

through an array values to find the first occurrence of a specific value, ELEM VALU.

At runtime, the loop is unrolled many times to expose the instruction and data-level

parallelism of the loop to the hardware.

Figure 3.4.B illustrates the processor pipeline stages. The highlighted stages are

the key di↵erentiations with respect to the conventional OoO processor model. These

stages are described in the example provided in Section 3.3.1.

Figure 3.4.C shows the high-level model of a two-wide CG-OoO core. The CG-

OoO processor front-end unrolls multiple loop iterations of the program by specu-

latively fetching, decoding, and register renaming instructions. In parallel with the

instruction renaming stage, upon reading a head operation in the fetch sequence, the

Block Allocator unit allocates an available Block Window (BW) to store the upcom-

ing instructions; in this figure, the two Block Windows are marked BW0 and BW1. It

also reserves a new block entry in the Block Re-Order Bu↵er (BROB) which is used

to maintain program order at block-level granularity. The Instruction Steer stage

dispatches upcoming instructions to the appropriate BW. The Instruction Scheduler

unit visits the head of each BW to issue ready instructions to an available execution

unit (EU). Once instructions complete execution, their results are written back into

a register or a store-queue entry, and at the same time, an energy e↵ective wakeup

unit updates each BW with the most recent changes in the program context. Finally,

the commit stage retires a code block once (a) it reaches the head of BROB, and (b)

all its operations complete the write-back stage.

3.3.1 Program Execution Flow in CG-OoO

Figure 3.5 is a cycle-by-cycle example of the instruction flow through the CG-OoO

pipeline stages. This Figure considers two consecutive iterations of the abovemen-

tioned do-while loop. Similar to Figure 3.4.C, the processor model in this example


BW0

LOOP:0xF00 head 0b’1, 0x4, 0x280xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 bne r2, r1, LOOP

HEAD:0xF00 head 0b’1, 0x4, 0x280xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 bne r2, r1, HEAD

HEAD:0xF00 head 0b’1, 0x4, 0x280xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 bne r2, r1, HEAD

BW1

EU EU

Instruction Scheduler

Block Predict / Fetch / Decode / Rename

Write Back / Block Commit

(A)Control-Flow

Graph

(C)CG-OoO

Execution Model

int index = -1;

do {index += 1;

} while (values[index] != ELEM_VALU);

BLOCKPREDICTION FETCH DECODE RENAME EXECUTE WRITE

BACKBLOCK

COMMIT

(B)CG-OoO

Pipeline Stages

INSTRUCTIONSTEER

Block Allocator

BLOCKALLOCATION

Instruction Steer

BRO

B

Figure 3.4: (A) control flow graph of a simple do-while loop, (B) the pipeline stagesof CG-OoO; the highlighted blocks show the key di↵erences with respect to the OoOcore, (C) the high level execution flow stages in the CG-OoO model with an issuewidth of 2 instructions / cycle, one per execution unit (EU). Each BW, in this archi-tecture example, is assumed to have two write and one read ports. Also, it has oneexecution cluster (see Figure 3.1 for details).


CYCLE BR PRED FETCH DECODE RENAME DISPATCH EXECUTE WB COMMIT BW0 BW1 BROB1 head, add2 head sll, lw head, add3 bne, head sll, lw head, add head14 head add, sll bne, head sll, lw add add head15 lw, bne add, sll bne, head sll, lw add sll, lw head1, head26 lw, bne add, sll bne sll add lw, bne head1, head27 lw, bne add, sll lw sll bne add, sll head1, head28 lw, bne add bne sll, lw, bne head1, head29 sll add bne lw, bne head1, head210 lw sll bne bne head1, head211 bne lw bne head1, head212 bne bne head1, head213 head bne head214 bne lw head215 bne head216 head

Figure 3.5: Instruction flow diagram for the example loop provided in Figure 3.4.A.This figure shows the flow of instructions in two consecutive iterations of the loop.Instructions in the first and second iterations are colored green and red respectively.The green table (right) shows the contents of BW0, BW1, and BROB at each cycle.head1 and head2 correspond to the BROB entries for the two loop iterations. WBstands for the write-back stage.

is a two-wide superscalar machine with the ability to issue one instruction per BW

per cycle to the two EU’s. Instructions in the first and second iterations are colored

green and red respectively.

In this example, all instructions, but lw, are assumed to be single-cycle operations,

and lw is a four-cycle long operation. As a result, because bne is data-dependent on

lw, its issue is delayed by four cycles when the load value is returned.

In cycle 1, {head, add} instructions are fetched from the instruction cache. The

immediate value in head is used to predict the next code block in cycle 2 after it

is fetched. Notice head speculates the next code block before the control operation,

bne, is fetched on cycle 3. head flows through the pipeline stages until it reaches

the Rename stage in cycle 3 at which point the Block Allocator assigns BW0 to the

instructions following head. At the same time, head reserves an entry in the Block

Re-Order Bu↵er (BROB) where it holds the status of the code block as its instructions

make progress through the pipeline (see head1 in the BROB). head1 is available to be

retired as soon as all instructions in the associates block complete their execution and

write-back their results to either a register or a store-queue entry. The same sequence

of events applies to the next loop iteration. The first block retires in cycle 13 and the

second one retires in cycle 16. The block speculation model in CG-OoO makes all


computations speculative. Upon retiring a block, all registers and store-queue data

generated by the block operations will be marked non-speculative.

The green table in Figure 3.5 shows the contents of BW0, BW1, and BROB over

time. For example, BW0 receives its first instruction in cycle 4 which is immediately

issued in cycle 5; more instructions from the same code block join BW0 in cycle 5.

The last instruction of the first loop leaves BW0 in cycle 10. BW0 and BW1 become

available to hold new code blocks in cycles 11 and 14 respectively. In cycles 12 and

13, no instruction is available to be issued because BW0 is empty and BW1 is stalling

on the lw instruction.

3.3.2 Control Speculation in CG-OoO

Control speculation allows processors to initiate the fetch of future instructions before

having completed the execution of current control instructions in the pipeline. Control

speculation improves processor front-end performance by avoiding fetch stall cycles.

The Branch Prediction Unit (BPU) is in charge of this task. OoO processors perform

BPU lookups immediately before every instruction fetch to avoid fetch stall cycles

irrespective of the instruction types in the fetch group [48]; this leads to excessive

energy cost on control speculation and redundant BPU lookup tra�c by non-control

instructions which in turn may cause lower prediction accuracy due to aliasing [37].

As pointed out earlier, in CG-OoO, head is the only instruction used to access

the BPU. Since head is usually ahead of its branch operation by at least one cycle,

often times, the probability of fetch stall cycles is low. This probability, however,

depends on the fetch-width of the processor and the common size of code blocks in

the application program. For example, in Figure 3.6, when the fetch-width is 2, head

and bne are fetched in two consecutive cycles. On the contrary, when the fetch-width

is 4, head and bne are fetched in the same cycle which in turn causes a fetch stall

cycle due to delayed prediction. As a result, the two processors have equal front-end

performance. In Chapter 6, the e↵ect of fetch stalls due to delayed branch prediction

lookup is evaluated.

Unlike BPU lookups by head operations, updates to either the branch predictor


or branch target bu↵er (BTB) during the WB stage, use the control instructions.

LOOP:0xF00 head 0b’1, 0x6, 0xF180xF08 add r3, r7, #10xF10 sll r0, r3, #30xF18 bne r2, r0, LOOP

FW = 2CYCLE BR PRED FETCH DECODE

1 head, add2 head sll, bne head, add3 head, add sll, bne4 head sll, bne head, add5 sll, bne

FW = 4CYCLE BR PRED FETCH DECODE

1 head, add, sll, bne2 head head, add, sll, bne3 head, add, sll, bne4 head head, add, sll, bne5

Figure 3.6: A simple loop (left). Font colors represent two consecutive iterations ofthe loop. The cycle-by-cycle instruction flow through the BPU, FETCH, DECODEpipeline stages (right); the top and bottom tables assume a CG-OoO processor withfetch-width = 2 and fetch-width = 4 respectively.

3.4 Squash Model

CG-OoO supports control and memory speculation. The squash process for the two

cases is slightly di↵erent. Here the two squash models are discussed separately.

3.4.1 Squash due to Control Mis-speculation

Squash events are handled at block-level granularity. Upon detecting a branch mis-

prediction, when the branch is in the execution pipeline stage, the front-end stalls

fetching new instructions, all code blocks younger than the mis-speculated control

operation are flushed from the pipeline, and the remaining code blocks are retired.

Once the BROB is empty, the processor state is non-speculative and the e↵ect of

wrong-path operations are discarded. At this stage, the processor can safely resume

normal execution by fetching new blocks from the instruction cache. For example, in

Figure 3.7, cycle 14 is when the processor can restart fetch.

Figure 3.7 shows the squash process through an example which illustrates the same

code sequence as the one in Figure 3.5 except that in this case the second iteration

is assumed to be mis-predicted by head in cycle 2. In cycle 11, bne is executed,


CYCLE BR PRED FETCH DECODE RENAME DISPATCH EXECUTE WB COMMIT BW0 BW1 BROB1 head, add2 head sll, lw head, add3 bne, head sll, lw head, add head14 head add, sll bne, head sll, lw add add head15 lw, bne add, sll bne, head sll, lw add sll, lw head1, head26 lw, bne add, sll bne sll add lw, bne head1, head27 lw, bne add, sll lw sll bne add, sll head1, head28 lw, bne add bne sll, lw, bne head1, head29 sll add bne lw, bne head1, head210 lw sll bne bne head1, head211 bne lw bne head1, head212 bne bne head1, head213 head bne head214 bne lw head215 bne head216 head

Figure 3.7: Instruction flow diagram for the example loop provided in Figure 3.4shown in case of a squash event. This figure shows the flow of instructions in thefinal iteration of the loop. Instructions in the two iterations are colored green andred respectively. Due to a branch mis-speculation, the final loop iteration is squashedand the entries with a strikethrough are never executed. The gray box is the time atwhich the branch mis-speculation is detected.

and in cycle 12 its result is compared against the speculated block program-counter

(BPC). In case of a conflict, a squash flag is raised to flush all mis-predicted, in-

flight operations from the second loop iteration and to remove the head2 entry from

BROB. The squash event also cancels all future activity by younger instructions (see

the strikethrough operations in Figure 3.7).

As noted earlier, in the write-back stage, instructions write their speculative re-

sults into a register (or a store-queue) entry. These results are marked non-speculative

only after their corresponding block retires. Thus, through the above execution pro-

cess, the data produced by wrong-path blocks are automatically discarded as such

blocks never retire. The values produced by add and sll in the WB stage in cycles

9 and 10 of Figure 3.7 are discarded.

3.4.2 Squash due to Memory Mis-speculation

Similar to the conventional OoO processors, the memory interface in CG-OoO consists

of a load-store-queue (LSQ) that operates at instruction granularity. The memory

mis-speculation detection follows the conventional model where once a store operation

detects a conflict with a younger load operation, a squash event is triggered. Handling


memory mis-prediction events, however, di↵ers from branch mis-speculation in that

the squash process is initiated at the start of the block holding the mis-predicted lw

operation meaning all younger blocks including the block lw is part of are flushed. As a

result, the older instructions that coexist in the same block as the lw are also squashed.

Since those instructions are older than the lw, they would not have been flushed in

the OoO model. The flush of useful operations is named wasted computation. In

order to reduce wasted computation, the compiler schedules lw operations as close as

possible to the top of the code block in order to avoid as much wasted computation

as possible.

3.5 Static Instruction Scheduling

As shown in Figure 3.4, the Instruction Scheduler unit visits the head of each BW

to find ready instructions to issue. The limited view of the dynamic scheduler into a

BW can inhibit it from finding ready instructions that may be blocked behind long

memory latency dependent operations in a BW. This problem can be mitigated if each

code block held an optimized code sequence that avoided poor dynamic scheduling

due to head-of-queue stalls.

In this work, I use static instruction list scheduling on each code block to enable

significant improvements in the processor performance (a) by optimizing the static

schedule along the critical path instructions in each block, (b) by improving memory

level parallelism via hoisting memory operations as close as possible to the top of

their code block, and (c) by minimizing wasted computation due to memory mis-

speculation. The impact of instruction scheduling on performance of CG-OoO is

discussed in detail in Chapter 6.

3.6 Sources of Parallelism in Cg-OoO

CG-OoO benefits from a hybrid of static and dynamic parallelism opportunities.

Here, I discuss the sources of these opportunities.


Memory Level Parallelism (MLP)

Since memory operations from di↵erent BW’s can be issued in parallel, CG-OoO

supports memory level parallelism. Figure 3.5, shows MLP in cycle 10 when the

two lw operations are in-flight at once. MLP is especially e↵ective during cache-miss

events. To further improve MLP, the compiler statically hoists memory operations

toward the head of their block to help the dynamic scheduler issue them earlier.

Block-Level Parallelism (BLP)

Block-level out-of-order execution manifests itself in the form of having multiple BW’s

issuing instructions to hide each other’s head-of-queue stall latency. For instance, in

Figure 3.7, in cycles 8, 9, 10, instructions from BW1 hide the latency of the lw from

BW0.

Instruction Level Parallelism (ILP)

ILP in the context of CG-OoO execution refers to instruction issue parallelism within

a code block. This type of parallelism is not presented in this chapter and is not

included in Figures 3.1 and 3.4. In Chapter 4, I elaborate on two energy e↵ective

techniques that improve the CG-OoO performance by allowing instruction level par-

allelism. The simplest model for instruction level parallelism is the case where more

than one instruction can be issued from each BW per cycle in-order. For instance,

if two consecutive instructions, at the head of a BW, were ready to issue, the sched-

uler would issue them both. The more involved model allows limited bypass across

head-of-queue stall instructions in order to find stall-independent instructions. Both

techniques are designed such that they avoid the energy cost and complexity of the

dynamic OoO scheduling.


3.7 Chapter Summary

I presented the CG-OoO processor execution model through an example that illus-

trates how instructions flow through the pipeline, how the squash unit rolls back

the execution, and how the static and dynamic instruction scheduling cooperate to

provide high performance dynamic execution. I also described the sources of energy

saving in the CG-OoO processor.

Next, the architectural details of the CG-OoO processor and how each design

element contributes to the overall performance and energy behavior of the processor

relative to the OoO processor is discussed.

Chapter 4

System Architecture

In this chapter, I discuss the architectural features of the CG-OoO processor model

and elaborate on how di↵erent design decisions contribute to saving energy in each

stage.

4.1 System Architecture

All major architectural units in the Cg-OoO processor are presented in Figure 4.1.

The front-end consists of the Fetch, Decode, Register Rename, Block Allocation,

and Instruction Steer units. It processes instructions in-order and clusters them into

code blocks that are processed by the back-end. The back-end consists of the Block

Windows (BW), Instruction Scheduler, Execution Units, Load-Store Unit (LSU), and

Block Re-Order Bu↵er (BROB). This chapter, provides details on each of these units,

describes how they communicate with each other, and highlights the design techniques

that enable energy saving.

4.2 Pipeline Stages

This section focuses on presenting the micro-architectural details of each stage and

the communication patterns between stages presented in Figure 4.1

27

CHAPTER 4. SYSTEM ARCHITECTURE 28

32KB L1 Instruction CacheBranch

PredictionUnit

Fetch

Instruction Decode

Register Rename

BW

Instruction Steer

Instruction Scheduler

EU EU EU EU

32KB L1 Data Cache

256KBL2

Cache

B-ROB

LSU

Head PC

BW BWBW

Block AllocationFron

t-end

Back

-end

Figure 4.1: Detailed Micro-architecture of the CG-OoO processor. The highlightedblocks are the key di↵erentiators of this processors compared to the OoO processor.The register file hierarchy is encapsulated in the BW modules. See Figure 4.11 fordetails.


4.2.1 Branch Prediction

Speculative execution enables latency hiding through accurate branch prediction.

This allows the processor front-end to speculatively run ahead of the current program

execution stage and provide dynamic code for the processor back-end to run during

unpredictable, long latency cache-miss events. As will be shown later in this chapter,

to save speculation energy, the Branch Prediction Unit (BPU) in the CG-OoO pro-

cessor limits speculation lookups to one per dynamic code block ; this is significantly

more energy e�cient than that of the OoO processor where BPU lookup is done at

every instruction fetch event. Chapter 6 quantifies the energy benefits of block-level

speculation.

Figure 4.2 shows the micro-architectural details of the branch prediction stage in

the CG-OoO processor; it consists of the 2Bc-gskew Branch Predictor (BP) [48], the

Branch Target Bu↵er (BTB), Return Address Stack (RAS), and the Next Block-PC

block. The next code block PC is computed through Equation 4.1.

PCNext�head = PChead + fall-through-block-offset (4.1)

The fall-through-block-offset is held as an immediate field of the head in-

struction as previously shown in Figure 3.2. This value is generated at compile time

to help compute the next head PC.

In contrast with the conventional BPU access approach where every fetch group

PC would be used to access the BPU, in the CG-OoO model, only head PC’s access

the BPU. Upon lookup, a head PC is used to predict the next head PC. Specu-

lated PC’s are pushed into a FIFO queue structure dedicated to communicate block

addresses to the fetch unit named Block PC Bu↵er. Section 4.2.2.1 presents a de-

tailed processor front-end runtime example in which the branch predictor behavior is

elaborated.

When control operations complete execution, they verify the prediction correct-

ness; if the prediction was correct, the corresponding BPU entry(ies) would be rein-

forced and if it was incorrect, it would be reversed. However, since in the prediction

phase, the BPU is indexed by head PC’s, it is necessary that BPU updates also index


BIM

META

Branch Prediction Unit

INS CacheUnit

INS Decode

Unit

BTB

head PC

RAS

Next Block PC

Next Fetch Block Address Prediction

Instruction Cache Access

Instruction Decode Stage

Next fetch address from exception or branch misprediction

Block PC Buffer

G-Share

fall-through-block-offset

Figure 4.2: Branch Prediction Unit (BPU) micro-architecture. Next Block PC com-putes the next fall-through block PC. Block PC Bu↵er holds the outstanding Blockaddresses the Fetch Stage must consume. In this model, each head PC is used topredict the next head PC. Conditional operations within each block would be usedto confirm predictions, and if needed, update table entries.

the table using the same PC’s. To do so, control operations access their corresponding

BPU entry(ies) by first computing their head PC using Equation 4.2.

PChead = PCcontrol�op

� code-block-offset (4.2)

The remainder of the branch prediction update process is similar to conventional

CPU models; the global prediction history queue is speculatively updated at pre-

diction time and the BP, BTB, and RAS table entries are updated once control

instructions complete execution.


Upon a squash event, conventional branch predictors would undo the global his-

tory queue, the only speculatively updated unit in the BPU. In the CG-OoO archi-

tecture, the squash protocol also flushes the Block PC Bu↵er.

By only accessing the BPU through head operations, the CG-OoO branch predic-

tor is 53% more energy e�cient than that of the OoO model (see Figure 6.12). Recall

from Chapter 2 that the OoO model accesses the BPU on every fetch group. Chap-

ter 6 evaluates the energy and performance characteristics of the CG-OoO branch

predictor.

4.2.2 Fetch Stage

The Fetch stage accesses the instruction cache to load future instructions for the

processor back-end to execute. Depending on the processor width, Fetch may load

di↵erent number of instructions per cycle (e.g. 2, 4, or 8 instructions). The CG-OoO

Fetch stage loads code blocks from the instruction cache after receiving code block

addresses from the BPU. Code Block addresses are pushed to the Block PC Bu↵er by

the branch predictor and popped by the fetch front-end. When the Block PC Bu↵er is

full, the BPU stalls and waits for fetch to drain the bu↵er. When this bu↵er is empty,

the fetch unit stalls and waits for the BPU to produce new code block addresses.

Figure 4.3A illustrates a control flow graph with five basic-blocks. Each block is

marked with its head identifier, h, at the top, and its control operation identifier (if

any), c, at the bottom. Figure 4.3B illustrates the mapping of these basic-blocks to

the instruction cache where each rectangle represents a 64-bit instruction and each

group of four adjacent instructions with the same color shade represent a fetch group.1

Instructions entries marked I show the mapping of non-control, non-head operations

within the basic-blocks shown in Figure 4.3A.

In order to correctly fetch code blocks, the fetch unit supports three block fetch

scenarios:

1. Fetch-group with zero head instruction

2. Fetch-group with one head instruction

1This is assuming that the instruction cache does not support fetch-alignment [32].


3. Fetch-group with more than one head instruction

The mapping of instructions to the cache model in Figure 4.3B includes examples

of all three fetch scenarios listed above. All cache lines holding no head operation

represent case 1; the fetch groups containing h1 and h2 represent case 2; the fetch

group containing {h3, h4} represents case 3.

h1

c1

h2

c2

h3

h4

c4

B2 B3

B1

B4

(A)

I I I I I I I Ih2 II c1I Ih1 Ic2 h3 I h4 I I I III II II II I I I I I c4 h5I II II II I

Instruction Cache

(B)

I

h5

c5

B5

I I I I I I I c5I II II II I

Figure 4.3: (A) A control flow graph with five blocks labeled B1 to B5. Each block hasa head operation labeled with h1 to h5. c1 to c5 corresponds to control instructionmembers of code blocks. (B) Mapping of operations in the control flow graph ontoan instruction cache. This figure assumes a 4-wide fetch unit with no fetch alignmentsupport; instruction sequences in the same color shade would be fetched together. Icorresponds to non-head and non-control instruction members of code blocks.

The Block PC Bu↵er can hold either a PC value or a 0x0 value. A PC value

prompts the fetch unit to start the next block fetch from the specified address. A

0x0 value indicates the next-block PC value is unknown. This situation happens

when a head operation, hh, is predicted not-taken. For the predictor to identify the

fall-through block PC (Equation 4.1), it needs to have access to the fall-through-

block-offset field of hh which may not have yet been fetched. When the fetch stage

encounters this situation, it assumes the fall-through block is the block immediately


after the hh block in the memory. So, it continues fetching the next block while the

fall-through block address for hh is computed. If the fetch unit completes the next-

block fetch before the fall-through block address is computed, fetch stalls until this

information becomes available. The next section describes the mechanism to predict

head PC’s via an example.

4.2.2.1 Instruction Fetch Example

Assume the program control flow graph in Figure 4.3A and the 4-wide fetch cache

model in Figure 4.3B. Furthermore, assume a prefect branch predictor produces the

following sequence of dynamic code blocks to be fetched: {B1! B3! B4! B1! B2

! B4}. For this fetch example, the sequence of predict, fetch, and decode activities

along with the Block PC Bu↵er contents are shown in Figure 4.4. This example

assumes the Block PC Bu↵er initially holds the PC of h1 from past predictions (cycle

0). The instruction cache starts by fetching B1 in cycles 1-2. In cycle 2, the fetch of

B1 completes via detecting h2. In cycle 3, fetch starts at the cache line where h3 is

located. At the end of cycle 3, the fetch of B3 completes via detecting h4. Since B3 is

predicted not-taken, the fetch unit continues fetching the B4 instructions even though

the fall-through destination address (i.e. h4 PC) is still unknown. The address for h4

would be verified in cycle 4 when the block o↵set value for h3 is available. In cycle

8, the fetch of B4 completes via detecting h5. In cycle 9, B1 is fetched. Its fetch is

completed in cycle 10, though since B1 falls through B2, the fetch unit continues its

fetch until it reaches cycles 15. In cycle 15, the fetch for B2 completes via detecting

h3. B4 is the block to be fetched after B2. Because h4 is located in the fetch-group

that was just fetched, the fetch unit continues its fetch for B4.

In cycle 1, h3 is predicted using the h1 PC. In cycle 2, h3 is predicted not-taken,

though its destination address remains unknown until after h3 is fetched. In cycle 4,

the fall-through address becomes available as h4 which replaces the 0x0 entry at the

head of the Block PC Bu↵er. h4 is used to predict h1 as the next block address. Since

h1 predicts not-taken, its destination address would be unavailable until cycle 9 when

the fall-through o↵set of h1 is available after fetching h1. In cycle 10, h2 is computed,

which replaces the 0x0 entry at the head of the Block PC Bu↵er. In cycle 11, h2 is


CYCLE BR PRED FETCH DECODE Block PC Buffer0 h11 h3 h1, I, I, I h32 not-taken I, c1, h2, I h1, I, I, I h3, 0x03 c2, h3, I, h4 I, c1, h2, I 0x04 h4 I, I, I, I c2, h3, I, h4 h45 h1 I, I, I, I I, I, I, I h16 not-taken I, I, I, I I, I, I, I h1, 0x07 I, I, I, I I, I, I, I h1, 0x08 I, I, c4, h5 I, I, I, I h1, 0x09 h1, I, I, I I, I, c4, h5 0x010 h2 I, c1, h2, I h1, I, I, I h211 h4 I, I, I, I I, c1, h2, I h412 h1 I, I, I, I I, I, I, I h4, h113 h3 I, I, I, I I, I, I, I h4, h1, h314 not-taken I, I, I, I I, I, I, I h4, h1, h3, 0x015 c2, h3, I, h4 I, I, I, I h4, h1, h3, 0x016 I, I, I, I c2, h3, I, h4 h1, h3, 0x017 I, I, I, I I, I, I, I h1, h3, 0x018 I, I, I, I I, I, I, I h1, h3, 0x0

Figure 4.4: An example code fetch sequence that follows the code example in Fig-ure 4.3. The left table shows the sequence of events in the BPU, Fetch, and Decodeunits over time. The right table shows the content of the Block PC Bu↵er over time.

popped from the Block PC Bu↵er to verify the fall-through block being fetched is

valid, and to predict h4 as the next block to fetch. Following the above prediction

mechanism, in cycles 13 and 14, blocks h1 and h3 are predicted respectively. In cycle

14, h3 is predicted not-taken. The BPU then stops predicting more PC’s until when

the corresponding instance of h3 is fetched.

In summary, the above prediction mechanism predicts the next block using the

current block PC until the point where a block is predicted not-taken. At that point,

the BPU stalls until the fall-through block o↵set becomes available. In such a case,

the fetch unit continues fetching one more block (i.e. the fall-through block) while

waiting for the Next-Block PC to be computed.

If the Inequality 4.3 holds, the fall-through destination address would be known


prior to the fall-through block fetch cycle.

BS � FW ⇥ FPD (4.3)

BS represents the Block Size, FW represents the Fetch Width, and FPD represents

the Fetch Pipeline Depth. When BS is so small that this relationship does not hold,

to avoid fetch stalls due to unknown fall-through addresses, the BPU pushes a 0x0

into the Block PC Bu↵er. This value notifies the fetch unit that it must continue its

fetch upon detecting a new block assuming the next observed code block is the fall-

through value. When the fall-through o↵set value becomes available, the Next Block

PC generator verifies the expected fall-through address matches the next fetched

block. The compiler also guarantees that fall-through codes always follow each other

in the binary code. The adversarial case that can stall the fetch pipeline is when

several very small code blocks predict not-taken. Such block sequences, however, are

rare to occur in practice as shown in Chapter 6.

4.2.2.2 Fetch Alignment

Figure 4.3B shows operation h3 as the second word in its fetch group. When c2

jumps to h4, the fetch unit could fetch instructions {c2, h3, I, h4}. In this case, the

Decode Stage would turn c2 into a NOP. Such operations are highlighted in red in

the Decode column of Figure 4.3B. Alternatively, the fetch unit could start fetching

from h3, fetch an additional operation from the next fetch group, align h3 to be the

first operation in the fetched group, and send the four operations {h3, I, h4, I} to

the Decode stage. The Cg-OoO front-end performs the latter fetch alternative to

maximize the number of useful instructions fetched per cycle. Though, for the sake

of simplicity, the example in Figure 4.4 does not support such fetch alignment.

4.2.2.3 Early head Detection & Forwarding

Figure 4.5 illustrates the head forwarding logic immediately after instruction cache

access. It compares every instruction opcode against the head opcode and if there

is a match, it sends the head’s fall-through-block-offset field to the BPU. By


transferring head’s content before the fetch group is Decoded, this simple logic unit

minimizes the delay to generate the fall-through address. Notice early head detec-

tion assumes a fixed bytecode ISA. If an ISA generates a variable-size bytecode, the

fall-through-block-offset would be forwarded to the BPU at the end of the De-

code stage.

InstructionDecode

InstructionCache

(4-Wide Fetch)

Opcodehead

Opcode

Opcode

headOpcode

Opcodehead

Opcode

Opcodehead

Opcode

TowardBPU

Figure 4.5: A simple logic unit used to detect head operations immediately afterfetch. The fall-through-block-offset field of detected head operations would beforwarded to the BPU for Next Block PC computation.

4.2.3 Decode Stage

The instruction decode micro-architecture follows that of the conventional OoO ar-

chitecture model. One di↵erence is that in the CG-OoO processor, this stage distin-

guishes between global and local register operands by appending a 1-bit flag, called


the Register Rename Flag (RRF), next to each register identifier. For local operands

it sets the value to 0 and for global operands it sets it to 1. The register rename

looks up this bit to determine which operands must visit the Register Rename stage.

Figure 4.6 shows the RRF field.

As discussed earlier, in Figure 4.4, the operations marked red in the Decode col-

umn are invalid and to be discarded from the execution flow. This stage identifies

invalid operations and turns them into NOP’s. For example, in cycle 16, operations

belonging to h3 would be discarded as B3 is not part of the execution sequence at

that time. While the fetch of some of these operations would be avoided via fetch

alignment, some still remain for the Decode Stage to handle.

4.2.3.1 Fetch vs. Decode Width in CG-OoO

As mentioned in Chapter 3, head operations are used to identify block boundaries

for the Block Allocator Stage. Beyond that stage, head operations have no utility.

As a result, once head reaches the Block Allocation Stage, it is discarded from the

execution pipeline.

To maintain equivalent front-end performance between the OoO processor and

CG-OoO processor, it is critical they both dispatch the same number of operations on

average. Because the addition of head operations to the Instruction Set Architecture

(ISA) reduces e↵ective fetch bandwidth of the front-end, the analysis provided in

Chapter 6 assume a 6-wide Fetch unit for the CG-OoO processor and a 4-wide Fetch

unit for the OoO processor. However, their Register Rename stages remain 4-wide to

maintain dispatch fairness between the OoO and CG-OoO processors.

4.2.4 Register Rename / Block Allocation Stage

4.2.4.1 Register Rename

As mentioned earlier, the Decode stage identifies and tags local / global register

operands for each instruction. In this stage, if an instruction holds a global register

operand, it accesses the register rename tables to receive its physical register. Oth-

erwise, the instruction would skip the register rename lookup. Figure 4.6 shows the


control logic for a single instruction operand choosing to access or bypass the Reg-

ister Rename stage. If the RRF is high, the register must be renamed, otherwise

its statically allocated value is used. Statically managed registers access a register

file structure called, the Local Register File. Renamed operands access a register file

structure called, the Global Register File. The register file hierarchy in this work is

further discussed in Section 4.2.7.1.

Operands that need to be renamed follow the conventional Merged Rename and

Architectural Register File renaming model in the Alpha 21264, MIPS R12000, and

Pentium IV processors [17, 22, 27]. The only di↵erence is that updating the com-

mit Register Alias Table (commit-RAT) happens at block granularity, meaning all

global write operands, belonging to a committing code block, commit together. If

the number of write-ports to the commit-RAT is smaller than the total number of

global write operands in a code block, commit of the code block may take multiple

cycles. To support block-commit, each entry in the Block Re-Order Bu↵er (BROB)

stores its global write register operands. More details on BROB micro-architecture

is provided in Section 4.2.9.

Skipping the register rename stage reduces the renaming lookup energy by an

average of 30%. This feature is not possible in the OoO core as its execution model

enforces maintaining program data-flow dependency at instruction granularity rather

than block granularity.

4.2.4.2 Block Allocation

After the Decode Stage detects a head operation, Block Allocator, finds an available

Block Window (BW) to host upcoming block operations. If no BW is available to

be allocated, the processor front-end stalls. Figure 4.7 shows the state transition

diagram for the block manager. Each block is initially empty and Available. Once a

Block Window is selected by the allocator to store a code block, it transitions to the

Busy (Fetching) stage where it would continue to receive instructions until the last

instruction in the code block is fetched. As mentioned earlier, the end of a block is

detected via detecting the next head operation in the fetch sequence. Then, the BW

moves to the Busy (Done Fetch) state. It stays in that state until all instructions


Register Rename

RRF REG

P-REG

Figure 4.6: The register rename bypass logic. When the Register Rename Flag (RRF)bit is high, the register identifier must be renamed. Otherwise, it would skip reaming.The RRF is appended to each register identifier during the Decode Stage. The tri-state driver avoids register rename table lookups for local operands.


held in the BW are issued for execution.

The Block Allocation Stage maintains a FIFO queue of all BW’s in the Available

state. It maintains a register pointer to the BW whose instructions are currently being

fetched. This register will be looked up by the Instruction Steer unit (Section 4.2.5) to

transfer instructions to the appropriate BW destination. In addition, Block Allocator

allocates an entry in the Block Re-Order Bu↵er (BROB).

Notice BW’s move from Busy states to the Available state before all their corre-

sponding operations are completed. This implies the BROB size must be larger than

the total number of BW’s in the processor to lower the chance of structural hazards

due to BROB-full events.

Available Busy(Fetching)

Busy(Done Fetch)

Squash

BW Empty /Squash

Block-fetched in one cycle

Start New Code Block Fetch

Code Block Fetch /Instruction Issue

BW Not Empty(Instruction Issue)

Start

Code Block Fetch Completed(i.e. next head

operation detected)

Figure 4.7: State Transition Diagram of the Block Allocator Unit. The status of eachBlock Window is tracked using this state transition diagram. Available BW’s can beused to store upcoming code blocks from the processor front-end.

Whenever the code block size is smaller than the processor decode width, the

allocated block transitions directly from the Available to Busy (Done Fetch) state.

Upon a squash event, all BW’s holding instructions younger than the instruction


Decoder

Block AllocatorRegister Rename

Is head

?

Figure 4.8: The logic diagram routing head operations to the Block Allocator and allother instructions to the Register Rename Unit.

initiating the squash must be flushed. The state transitions to Available indicate all

such blocks flush and reset their internal context and move to the Available state.

Section 4.3 discusses the BW context reset process.

To shorten the pipeline depth, the Block Allocation and Register Rename stages

are consolidated into a single pipeline stage given the type of instructions processed

by the two units are mutually exclusive. Figure 4.8 shows the interface between the

Decode and Register Rename / Block Allocation stages. All head operations in a

decode group are routed to the Block Allocation unit; the rest are routed to the

Register Rename unit. The demultiplexer select line has one bit for each operation

in the decode group.

Figure 4.9 shows the status signals (dotted arrows) traveling from BW’s and

BROB to the Block Allocator. The solid arrow from the Block Allocator to the

Instruction Steer Switch Network sets the destination BW(s) for upcoming instruc-

tions from the rename stage. Also, memory operations allocate an entry in the LSU.

At the same time, the solid arrow from the Block Allocator to BROB allocates a new

entry in the BROB for the new code block. When a BW completes issuing all its


instructions, it notifies the Block Allocator to switch its state to Available. When

BROB runs out of empty slots, it notifies the Block Allocator to stop allocating new

blocks. When Block Allocator receives a BROB-Full message or runs out of Avail-

able BW’s, it sends a stall-message to the Fetch unit (see the dotted line from Block

Allocator to Fetch). Likewise, when the LSU runs out of space, it notifies the Fetch

unit to stall.

Decoder

BW 0 BW n-1

Instruction Steer Switch Network

Register RenamerBlock Allocator

Decoder

BW

Sta

tus S

igna

lsSt

all F

etch

Fetch

BROB LSU

Figure 4.9: CG-OoO front-end highlighting the control signals used between BlockAllocator and other units to communicate the processor resource utilization status.LSU is the Load-Store Unit, BROB is the Block Re-Order Bu↵er, and BW is theBlock Window.

4.2.5 Instruction Steer Stage

This stage consists of a point-to-point interconnection network that directs instruc-

tions to their corresponding BW destination while maintaining the sequential order

between instructions within a fetch group. For fetch groups where instructions be-

long to more than one BW (e.g. Figure 4.10D,E) the network steers instructions by


selecting the proper BW for each operation separately.

In parallel with steering instructions to their designated BW, the global write

operand identifier of each instruction is copied to its corresponding BROB entry

(i.e. the tail entry of BROB). These physical register identifiers are later used at

commit time to update their corresponding architectural register state. For more, see

Section 4.2.9.

4.2.6 Front-end Examples

Figure 4.10 shows several independent instruction sequences assuming all instructions

in the same line belong to the same fetch and decode group in a four-wide superscalar

CG-OoO processor front-end. The register operand identifiers starting with G refer

to global register operands (prior to register renaming) and the identifiers starting

with L refer to local register operands. In Figure 4.10A, all operations have global

read and/or write operands. These operands are renamed before they are dispatched

to their BW. Block Allocator detects the head operation in this group and assigns

an Available BW to the new code block. In Figure 4.10B, all operands in the fetch

group are local. As a result, the fetch-group can bypass the Register Rename / Block

Allocation stage and move directly to the Instruction Steer stage. On the contrary, in

Figure 4.10C, although all operands are local and bypass the Register Rename stage,

because this fetch group initiates a new code block, instructions wait for the Block

Allocator unit to determine their destination BW. In Figure 4.10D, the first operation

on the left belongs to a di↵erent block than the rest of the fetch group. In this case,

lw L1, Addr bypasses Block Allocation / Register Rename stage and moves to its

destination BW. The rest of the fetch group stalls until the new available BW is

determined. Figure 4.10E shows the case where two heads appear in a fetch-group.

The Block Allocator stage is capable of allocating more than one BW per cycle.

Two Available BW’s are selected from the Available BW’s Queue. Each operation

is individually routed to its destination BW according to the BW ID set for it in

the Instruction Steer stage. Thus, instructions belonging to the two code blocks are

separated and routed into two code blocks at the same time. Because the first code


lw L2, Addr lw L1, L2 add, L2, L1, L1 sub L2, L2, #2

headlw L1, Addr add, L2, L1, L1 sub L2, L2, #2

head bne G1, G2, loop head add, L2, L1, L1

head lw G1, Addr add, G1, G1, L1 sub L1, G1, #2A

B

D

E

head lw L1, Addr add, L2, L1, L1 sub L2, L2, #2C

Figure 4.10: Five example fetch and decode instruction groups showing di↵erentinstruction combinations.

block is smaller than the decode group size, it completes its fetch in one cycle, and

its corresponding BW transitions directly from the Available to Busy (Done Fetch)

state (see Figure 4.7).

4.2.7 Issue Stage

Arguably, the scheduling stage is one of the most challenging units in the OoO pro-

cessor critical path as several events must be handled within a small number of cycles.

In this section, I discuss a complexity e↵ective and energy e�cient instruction issue

(selection, wakeup) mechanism designed for the CG-OoO processor. Such a scheduler

must be fast and energy e�cient in selecting ready operations. It must also support

a fast wakeup mechanism with low energy overhead in activating future ready oper-

ations.

Before focusing on the instruction issue mechanisms, I describe the individual

micro-architectural components of the Issue stage.

4.2.7.1 Issue Micro-architecture

This stage consists of several Block Windows that bu↵er instructions for execution.

As shown in Figure 3.5 and 4.1, BW’s receive operations via the Instruction Steer


Stage, and issue operations via the Instruction Scheduler Unit described later in this

section. Each BW consists of several key components; Figure 4.11 highlights these

components.

Instruction Storage Instruction Queue (IQ) is a FIFO queue that holds code

block instructions. Figure 4.12A shows the fields associated with each IQ table entry.

IQ has a finite size; the compiler splits the static code in case a code block holds

more instructions than this size. Splitting blocks increases block level parallelism by

turning large code streams into two separate code streams. However, it also increases

the Global Register File pressure. Thus, choosing the right block size is essential in

delivering high performance computation while keeping the energy consumption low.

Chapter 6 discusses the energy-performance trade-o↵ for di↵erent IQ sizes.

Head Bu↵er (HB) is a small bu↵er used to hold instructions waiting to be issued

by the Instruction Scheduler. HB pulls instructions from the head of IQ FIFO and

waits for their operands to become ready. Once an operation has all its operands

ready, the Instruction Scheduler removes it from the HB and issues it for execution.

As shown in Section 4.2.7.2, depending on its micro-architecture setup, HB may hold

between one to four instructions. Figure 4.12B illustrates the logical structure of

each operation when stored in the HB; each source operand maintains a Ready bit,

indicating if its data is available, and a 64-bit field to hold its data. The data fields

would either be populated by a register file read or by the wakeup unit as discussed

in Section 4.2.7.1.

Register File Hierarchy Two types of register files exist in the CG-OoO processor:

a Global Register File (GRF), and several Local Register Files (LRF). The GRF is

managed by the Register Rename unit while LRF’s are managed by the compiler. The

GRF holds registers used to communicate data across BW’s. A LRF, on the other

hand, holds register operands used to communicate data between instructions within

the same code block. Each BW maintains a dedicated LRF to save intermediate data

used among its instructions.


InstructionQueue

LRFHead Buffer

BlockWindow

GRFSegment

EUEU

Figure 4.11: Components of a Block Window (BW) consist of the Instruction Queue(IQ), the Head Bu↵er (HB), the Local Register File (LRF), and a Global RegisterFile Segment. EU’s may be shared among multiple BW’s.

IQ Entry

Opcode ImmediateHB Entry R-REG2R-REG1W-REG

Rea

dy

Rea

dy

Source Operands “Ready” Bits

Opcode Immediate R-REG2R-REG1W-REG

A

B

Val

id

Val

id

Val

id

Val

id

Val

id

Val

id

Val

id

Val

id R-REG2DATA

R-REG1DATA

Source Operands “Data”

Figure 4.12: (A) Contents of each Instruction Queue (IQ) entry. Valid bit specifiesif the given field is used by the Opcode; when it is set to 0, the scheduler ignoresthe field. Valid bits are set by the Decoder. R-REG refers to read register identifierand W-REG refers to write register identifier. The identifiers can either be local orglobal. (B) Contents of each Head Bu↵er (HB) entry. When the Ready bits of allValid source operands are set to 1, the operation is ready to be issued.


Issue Pipeline In conventional OoO processors, the select and wakeup units are

among the highly energy consuming hardware units [16]. As mentioned in Chapter 2,

the wakeup unit utilizes long wires to transfer data from producing execution units

to pending instructions in the IQ waiting for their operands. This unit also consumes

energy via accessing large Content Addressable Memory (CAM) arrays in the OoO

processor Instruction Window to wakeup operations dependent on recently generated

results.

Chapter 2 discussed two instruction issue design models where one model spends

an extra pipeline cycle to read results after issue and the other model saves the extra

cycle by reading and storing data in the Instruction Queue immediately after register

renaming. CG-OoO is a middle ground between the two design models where it uses

limited CAM storage space to bring data from register files to instructions once they

are about to be issued. If not all read operand values are available at the register

file lookup time, instructions wait until their operands become available through the

wakeup mechanism.

The instruction issue pipeline stages are shown in Figure 4.13. Once instructions

flow through the Register Rename and Instruction Steer stages, they are pushes to the

IQ of their BW. Once the HB has an empty entry, the BW controller pops the head

of IQ instruction and pushes it to the end of HB. At the same time, the instruction

reads its source operands from the register file(s). If an operand finds valid data in its

register file, the data is copied to the HB source operand data entry, and its Ready bit

is set. If the data is not yet available in the register file, its corresponding Ready bit

remains zero until when the wakeup unit forwards its data. When all source-operand

Ready bits are set, the instruction may be selected by the Instruction Scheduler and

driven to an available Execution Unit for execution. Once an operation is selected

/ popped from the Head Bu↵er, the instruction at the IQ head replaces its spot by

repeating the same steps described here.

The instruction selection stage follows the Oldest Ready Block arbitration proto-

col; it visits BW’s in the order of oldest to youngest dynamic code blocks to find ready

operations. Once it finds as many ready operations as needed to occupy all execution

units, it stops issuing more operations. The baseline OoO processor uses the Oldest


Register Rename Instruction Steer

Data Read Select

Wake-Up

Drive Execute

Instruction Allocation Stage

InstructionIssue Stage

Time

Figure 4.13: Pipeline cycles of the instruction allocation and instruction issue stagesin the CG-OoO processor.

Ready Instruction arbitration protocol as seen in many previous architecture includ-

ing the Alpha 21264 [27]. My choice of using the Oldest Ready Block arbitration

protocol allows comparing, as much as possible, two execution models that attempt

to prioritize instruction issue base on the program critical path. Discovering whether

the Oldest Ready Block arbitration protocol is the optimal instruction issue model

for the CG-OoO remains a future research topic.

Figure 4.14A shows an Instruction Queue holding nine operations belonging to a

code block. The strike-through entries correspond to operations that have already

been issued and executed. The IQ Head Pointer is pointing at the upcoming instruc-

tion to be pushed to the Head Bu↵er shown in Figure 4.14B. Because the HB can

hold up to three operations in this figure, operations {3, 4, 5} are the next set of

instructions moving to the HB; their static fields including Opcode, Immediate, Des-

tination Register Identifier, Valid Bit move to the HB RAM array and their Source

Operand Identifiers and Valid Bits move to the CAM arrays; all CAM table Ready

bits are initialized to zero. In the same cycle, source operands access their register

file(s) to load their CAM array Data entries and set their Ready fields. If the value of

a source operand is not available in a register file, the instruction waits in the Head

Bu↵er until its data is computed. The wakeup unit is responsible for updating the

source operands not found at register file lookup time.

Wakeup Pipelining To avoid stalls in the issue logic, wakeup events can be

pipelined as shown in Figure 4.15A. Upon selecting an operation, with predictable


Src. Reg. 1R-GLBR-GLB

R-GLBR-GLBR-LOC

R-GLBR-GLB

R-LOCR-GLBR-GLB

W-LOCW-GLB

W-LOCW-GLBW-GLB

ImmImm

ImmImmImm

OpcodeOpcode

OpcodeOpcodeOpcode

Reg

ID

Dat

a

Src. Reg. 2

Wakeup SupportCAM Tables

Instruction Queue (IQ)

A

R-GLBR-GLB

R-GLBR-GLB

W-LOCW-LOC

ImmImm

OpcodeOpcode

Val

id

Rea

dy

Reg

ID

Dat

a

Val

id

Rea

dy

Head Buffer (HB)

IQ Head

IQ Tail

10

432

765

R-GLBR-GLBW-LOCImmOpcode

B

Opc

ode

Reg

ID

Val

id

Imm

RAM Table

R-GLBR-GLBW-LOCImmOpcode8

Figure 4.14: (A) Instruction Queue holds nine operations. The strikethrough opera-tions refer to instructions already issued and completed. The operations in green areat the head of IQ and about to be dispatched to HB. R refers to read operands andW refers to write. The dark gray fields are global source operands and the light grayones are local. (B) A Head Bu↵er with three entries. For simplification purposes,this figure assumes a maximum of two source operands per operation, similar to theMIPS ISA. The CAM table stores data read from register files before issue and alsosupports associative search for the wakeup unit.

latency, one of the following two scenarios happen. (A) if the destination register of

the selected instruction is local, it signals a wakeup message to its own BW CAM ta-

bles; (B) if the destination register of the selected instruction is global, it broadcasts a

wakeup message to all busy BW’s. Then, instructions with dependent operands would

update their corresponding Ready bits. If a dependent instruction has all the rest of

its operands ready, it would be ready for issue. If issued, the data corresponding to

its dependent operand would be forwarded to it just before the execution stage.

As shown in Figure 4.15B, the wakeup process is slightly di↵erent for operations

that produce results with an unpredictable latency. Such an operation would wakeup

its consumers after it passes through the Execute stage. Notice, however, that the

consumer instruction may be moved to the Head Bu↵er far in advance. Figure 4.15B

shows an example case where a consumer is the instruction immediately after its

producer in the same BW. Once the producer leaves the Head Bu↵er, the consumer

replaces it while accessing the register file to read its source operands. However, the

producer wakes up the consumer after its execution completes. Assuming all the rest


Producer(Unpredictable

Latency)

Producer(Predictable

Latency)

Data Read Select

Wake-Up

Drive Execute

Time

Data Read Select

Wake-Up

Drive Execute

Data Read Select

Wake-Up

Drive Execute

Data Read Select

Wake-Up

Drive Execute

Execute Write Back

Write Back

Write Back

Write Back

Unpredictable Latency

Consumer

Consumer

A

B

Data ForwardingWakeup Signal

Wakeup Signal

Waiting for Wakeup Signal

Figure 4.15: (A) When instructions have predictable execution latency, wakeup eventsare pipelined and data-forwarding is used to transfer intermediate results betweenoperations. (B) When instruction latency is unpredictable, wakeup events updatethe source operands before issue.

of its source operands were ready, the consumer is marked Ready and selected for

execution.

Issue Stage Energy E�ciency The issue model in the CG-OoO processor is

a hybrid solution between the OoO Pre-Issue and Post-Issue models presented in

Chapter 2. It reads register file data for operations in the HB only thereby (A)

avoiding the post-issue register file read cycle and (B) saving the pre-issue large data

storage overhead by only storing operations in the HB’s; the number of operations

in all HB’s is a fraction of all in-flight instructions. As a result, this issue model is

as fast as the OoO pre-issue model, and more energy e�cient than both models. Its

sources of energy e�ciency are:

1. In each BW, the wakeup unit uses a small Head Bu↵er storage space to hold

operand data. Operations stored in the Instruction Queue are not involved in

the wakeup process.

2. The wakeup unit accesses small CAM tables to search for source operands. For

instance, in a CG-OoO processor with 8 BW’s, each with 3 HB entries, the


wakeup unit accesses 48 CAM source operand entries. The OoO processor,

however, uses a 128-entry Instruction Window to search for ready operands.

3. Local write operands wakeup source operands associated with their own BW

only. This limits the wakeup scope to the two CAM tables in the same BW.

In the pre-issue model, register file lookup happens at the register rename stage

making the number of required register file read ports twice the decode width. In the

CG-OoO model, however, register lookup happens right before operations enter the

HB structure, suggesting the following number of read ports required for the GRF in

the worst case (Equation 4.4):

GRFRd�Port�Count

= 2⇥HBSize

⇥ BWCount

(4.4)

By design, this number is larger than twice the decode width. For example, in a

4-wide OoO superscalar machine, the decode width is 4, making the required number

of read ports 8. To produce competitive performance, as shown in Chapter 6, the

CG-OoO architecture is expected to have as many as 8 BW’s, each holding 3 HB

entries, making the number of required read ports be 48. Clearly, such large number

of ports is not feasible in practice. To mitigate this problem, two design solutions

are considered. First, the presence of LRF’s reduces the GRF lookup pressure by

about 30%. Second, the physical register file is divided into multiple small segments,

each having its own dedicated ports. In the example provided here, if each BW

holds an eighth of the GRF and each segment has as many as 4 read ports, the total

number of GRF read ports would add up to 32. Figure 4.16 illustrates the micro-

architecture design to access 8 GRF segments; the three most significant bits of the

register identifier would be used to select the register segment and the remaining bits

would be used to read an entry within the selected GRF segment. Notice segmenting

register files is also beneficial from an energy standpoint as each register value read /

write consumes an eighth of the energy consumed on a unified register file.

While the register file segmentation enables energy e�cient register file manage-

ment, this technique is independent of the CG-OoO execution model and may be


REG [0-5]

REG [5-8]

GRFS0

GRFS1

GRFS2

GRFS3

GRFS4

GRFS5

GRFS6

GRFS7

Figure 4.16: GRF segment access demultiplexer to access one of eight GRF segments.The three most significant bits of the register identifier are used to select a segmentand the rest are used to index into the GRF segment.

separately applied to existing architectures. The pros and cons of register file seg-

mentation are discussed by Tseng et al. [54].

It is possible that a GRF segment becomes a hotspot when instructions from

several BW’s would want to read a segment in the same cycle. In such cases, due

to a structural hazard, the processor postpones dispatching some instructions from

their IQ to HB to a later cycle. Chapter 6 shows the combination of hardware

and software solutions used here eliminate the hotspot condition in the benchmark

applications evaluated.

Operand Data Forwarding Data forwarding hides instruction issue stall cycles

by providing source operands to the next instruction immediately after the data is

generated. Figure 4.15A shows the e↵ect of data forwarding in supporting the wakeup

process. CG-OoO supports data forwarding by delivering operands between any two

execution units. In Figure 4.15B, the forwarding unit stores the recently produces

data into the HB entry of any woken up consumer (as well as into the corresponding

register file).

Figure 4.17 illustrates the interconnection network model between EU’s belonging

to di↵erent execution clusters. Forwarding between directly connected EU’s is one

cycle. The communication latency between EU’s more than one hop away from

each other is expected to be two or more cycles. However, the evaluation results

in Chapter 6 makes a simplification assumption that all forwarding communications


BW

EU EUEU EU

BW BW BW

D-Cache

L2

BW BW BW BW BW BW BW BW

I-Cache

BW BW BW BW

EU EUEU EU EU EUEU EU EU EUEU EU

FWD

FWD

FWD

FWD

Figure 4.17: The interconnection network connecting EU clusters together. Here,four EU’s serving their corresponding four BW’s form a cluster; in each cluster, EU’sare connected to each other via point-to-point links. Clusters are separated by theircolor, and are connected together through a higher network fabric. Thinner wireconnections (in blue) enable data forwarding between EU’s.

take a single cycle.

4.2.7.2 Issue Scheduler

To leverage dynamic instruction execution in the CG-OoO processor, three complexity

e↵ective instruction issue techniques are evaluated:

1. Single-Issue Scheduler

2. Multi-Issue Sequential Scheduler

3. Multi-Issue Skipahead Scheduler

This sections describes the pros and cons of each issue technique in detail, and

Chapter 6 evaluates the performance and energy behavior of each technique sepa-

rately.


1. sll r1, r2, #32. lw r0, r13. add r3, r4, #14. bne r2, r3, LOOP

1. sll r1, r2, #32. lw r0, r23. add r3, r4, #14. bne r2, r5, LOOP

(A) (B)

Figure 4.18: (A) A code snippet where four consecutive operations do not have registerdata dependency. (B) A code snippet where two data-dependencies exist; operation2 dependents on the result of 1, and operation 4 depends on the result of 3. Thedependency registers are highlighted in bold.

Single-Issue Scheduler The single-issue scheduler model refers to the case where

each BW is allowed to issue one instruction per cycle. This model serves as the

most energy e�cient CG-OoO processor I evaluate in this work. In this model,

each Head Bu↵er has one entry. It eliminates the need for CAM lookup during the

wakeup process and requires the instruction selection logic check the readiness of one

instruction per BW. However, this model may throttle the issue of useful work from a

single BW. For instance, when two back-to-back operations are not data dependent,

they can potentially be issued in the same cycle, but this model would not permit

the issue of both at once.

Figure 4.18A illustrates an example code sequence where all operations are in-

dependent from each other. Figure 4.18B illustrates a code sequence with two data

dependency links between operations 1, 2 and operations 3, 4. While the potential

ILP of the code in Figure 4.18A is twice that of Figure 4.18B, this issue model would

at most deliver ILP of 1 for both. That is, the single-issue model issues one instruc-

tion from the code sequence shown in Figure 4.18A even though all four operations

may be ready to be issued at the same time.

Multi-Issue Sequential Scheduler The multi-issue sequential scheduler general-

izes the single-issue model by allowing more than one operation to be issued from

each BW. In this model, each Head Bu↵er holds multiple entries and manages a

multi-entry CAM table. The scheduler arbitrates through all blocks and chooses the


BW with the oldest dynamic code block. It then issues multiple instructions from

that queue in a sequential order before issuing from younger BW’s. As soon as all

available EU’s become busy, the arbiter refrains from issuing more operations from

the remaining BW’s.

This scheduling model allows in-order issue of multiple operations from the same

HB. For example, assuming a three-entry HB, the case in Figure 4.18A would allow

issuing operations {1, 2, 3} in the same cycle and {4} in the next cycle. The case

in Figure 4.18B allows issuing operations {1}, {2, 3}, {4} in three separate cycles.

Operation 2 is data-dependent on operation 1, but operation 3 is independent from

both 1 and 2. However, because operation 2 cannot be issued along with operation 1,

it prevents younger instructions from being issued as well thereby delaying the issue

of operation 3.

Multi-Issue Skipahead Scheduler The Skipahead model takes instruction schedul-

ing further by allowing limited out-of-order issue of operations. Even though instruc-

tions within each BW may be issued out-of-order, the selection and wakeup energy

overhead is not any higher than the multi-issue sequential scheduler; however, the

skipahead average performance is 41% higher. Using this model, assuming a three-

entry HB, operations in Figure 4.18B would be issued as {1, 3} followed by {2, 4}.This model requires data-dependence checking when skipping across operations.

For instance, in Figure 4.18B, before issuing operation 3, its operands must be checked

for true and false dependencies against operands of skipped operation 2. If no de-

pendency exists, operation 3 is permitted to be issued out-of-order. To architect a

complexity e↵ective dependency checker, a simple XOR logic is used to cross refer-

ence younger operations against stalling older operations. Figure 4.19 illustrates the

operation dependencies logic. The dark green XOR gates check for write-after-read

(WAR) dependencies and light green gates check for read-after-write (RAW) depen-

dencies. If any of the operand-checks returns a non-zero value, skipahead issue will

not be allowed.


Opcode DST REG

SRC REG1

SRC REG2

Opcode DST REG

SRC REG1

SRC REG2

Opcode DST REG

SRC REG1

SRC REG2

OR

Ready?

OR

Ready?

Head Buffer

WAR Dependency Check

RAW Dependency Check

Figure 4.19: Head Bu↵er logic required to support the Skipahead issue model. Thedark green XOR gates are used to handle write-after-read (WAR) hazards, the lightgreen XOR gates are used to handle read-after-write (RAW) hazards. Write-after-write (WAW) hazards are automatically handled by these gates. If any XOR opera-tion produces a non-zero value, the corresponding instruction would not be permittedto skip ahead.

4.2.8 Memory Stage

The memory stage, in this work, consists of a Load-Store Unit (LSU) that allows

speculative issue of load operations across stores. Unlike most units described in this

chapter, the LSU operates at instruction granularity. To maintain memory access

order, memory instructions are inserted into LSU immediately after the register re-

name stage. Each memory operation stores its Block Sequence Number in its load /

store queue entry to use at the commit stage or squash (see Section 4.2.9). If the LSU

runs out of space to hold memory operations, it notifies the front-end to stall fetch.

Apart from the memory mis-speculation model in the CG-OoO processor, the rest


of the LSU matches that of conventional OoO processors. The same holds true for

the underlying processor cache hierarchy, the Miss Status Holding Registers (MSHR),

and the memory disambiguation predictor.

Memory operations may exist in any position within the code block. Whenever a

load operation mis-speculates due to a data-conflict with an older store operation, it

triggers a squash event that flushes all younger code blocks including the code block

within which the squash was triggered. For instance, if operation 2 in Figure 4.18A

were to trigger a memory mis-speculation event, then the entire block including op-

eration 1 would be squashed. Notice because operation 1 is older than 2, it would

not have been squashed in the OoO execution model. In this work, the squash of

useful operations is called wasted squash. To avoid wasting useful work, the compiler

schedules load operations as early, in the code block, as possible so that they are the

first instructions to be issued. Doing so also improves the memory level parallelism

of the program. In Chapter 6, the impact of wasted operations on performance and

energy is evaluated.

4.2.9 Write-Back & Commit Stage

Figure 4.20 shows the contents of a BROB entry. It holds the Block Sequence Number

(SN), Block Size, and all Global Write (GW) register operands in the block. The

Block Size entry is initialized by the BlkSize field of the head. The GW operands are

inserted into a BROB entry one by one as instructions (with global write registers)

move from the Register Rename Stage towards their BW. In this Section, the utility

of each field in the BROB is discussed.

4.2.9.1 Write-Back Stage

Once an instruction completes its operation, it writes its results into either a desig-

nated register file entry (global or local) or into the store queue. If the destination

operand is a local register, the instruction accesses its corresponding BW to update

the LRF; this can be done through a BW ID tag each operation carries through the

execution pipeline.


Block SN Block Size GW0 GW1 GW2 GW3 GW4 GW5 GW6 GW7 GW8 GW9

Figure 4.20: Each Block Re-Order Bu↵er (BROB) entry consists of a Block SequenceNumber (SN) field, a Block Size field, and a number of Global Write operand fields.To prevent structural hazards, the compiler controls the number of permissible globalwrite operands per code block.

Figure 4.20 shows each BROB entry holds a Block Size field; this field is designed

to track the number of completed operations for each in-flight code block. Upon each

instruction complete, the corresponding Block Size entry in BROB is decremented.

Once the value of a Block Size reaches zero, the corresponding block is completed.

4.2.9.2 Commit Stage

A block is committed when it is completed and is at the head of the BROB FIFO.

Upon commit, all GRF operands modified by the block must be marked Architectural

in the commit-RAT. To do so, the committing block uses its GW fields to update the

renaming state of its physical registers to Architectural. If the number of commit-RAT

ports is smaller than the number of global register operands, the commit process can

take multiple cycles.

Upon a commit event, store operations belonging to the committing code block

must retire. To do so, the “commit” pointer in the Store-Queue moves to the youngest

store operation belonging to the committing block. This store operation is found via

searching for the youngest store operation whose Block SN matches the committing

Block SN. Recall, each entry in the Store Queue has a Block SN field.

4.3 Squash

The squash mechanism is a hardware solution to roll back the processor state into its

most recent non-speculative state by undoing incorrect modifications to the program

context due to control and memory speculations. Control mis-speculation happens


when the BPU outputs incorrect program control predictions leading to fetch and

execution of wrong-path code. Memory mis-speculation happens when the Load-Store

Unit makes false predictions about the latest value of a memory location leading to

incorrect data computations.

The squash mechanism influences multiple tables and logic units within the pro-

cessor pipeline. Previous sections elaborated on some aspects of the squash process.

In this section, all key squash events are elaborates concretely. Upon a squash event,

the following events take place:

• The BPU history queue and the Block PC Bu↵er flush their content corre-

sponding to the wrong-path predictions. The code block PC resets its value

to the start of the right path. In case of a control mis-speculation, the right

path would be the opposite side of the control operation. In case of a memory

mis-speculation, the right path would be the start of the mis-speculated block.

• All BW’s holding younger code blocks than the mis-speculated operation flush

their IQ content, reset all Head Bu↵er entries to zero, and mark all their LRF

registers invalid.

• The Load-Store Unit flushes all operations corresponding to the younger code

block than the mis-speculated operation. This is done by comparing the mis-

speculated Block SN against the Block SN of each memory operation.

• The BROB entries corresponding to younger code block than the mis-speculated

operation are flushed. The remaining blocks will complete their execution and

commit.

• When the BROB is completely empty, commit-RAT holds the up-to-date ar-

chitectural state of all global registers. At this stage, the commit-RAT updates

the fetch-RAT. Then, the program restarts fetch and continues execution as

normal.

The general squash protocol for memory and control mis-speculation is the same.

The only di↵erence between the two cases is that in memory mis-speculation, the block


triggering mis-speculation is flushed along with all younger blocks but in control mis-

speculation it does not. This is because control operations define the end of a code

block and any control mis-speculation only influences blocks succeeding the control

operation. Load operations, on the other hand, can exist in any location within the

code block, and any time a memory mis-speculation happens, the load operation itself

must be squashed.

4.4 Chapter Summary

In this section, I discussed the architectural details associated with each pipeline

stage and provided insight as to how each unit delivers energy e�cient and high-

performance computing support to the execution model.

The key energy saving opportunities discussed in this chapter are:

• The selective branch prediction access model, where only head operations are

permitted to do BPU lookups

• The compiler support for local register operands that enables skipping the reg-

ister renaming stage

• The complexity e↵ective Block Window design where only a small numbers of

instructions, present in the Head Bu↵er, participate in the instruction wakeup

and selection process

• The segmentation of the Global Register File into smaller SRAM tables to read

and write data

• The use of Local Register Files to allow data transfer on short wires (contrary

to the GRF), and to allow accessing small SRAM tables to read and write data

• The use of roughly 10⇥ smaller re-order bu↵er to maintain runtime program

order

• The compiler support for producing e�cient instruction sequences for each code

block via block-level list-scheduling


While the CG-OoO processor provides a unique framework to exploit all the above

energy e�ciency opportunities, it is important to note these features are designed to

remain as decoupled from each other and from the execution model as possible. For

instance, the energy e�cient branch prediction model in this work may be replaced

with the conventional branch prediction model without disturbing the CG-OoO exe-

cution model. Likewise, the absence of local operands in a binary would not disturb

the execution model. This is particularly valuable for supporting binary backward

compatibility to run programs with binaries produced for existing OoO processors.

Chapter 5

Methodology

The evaluation setup for this work consists of four major software modules. This

chapter describes the details of each module (listed below) and lays out the foundation

for the evaluation analysis discussed in Chapter 6.

• A binary translator and compiler back-end

• A functional emulation engine, based on the Intel Pintool [31]

• A cycle-by-cycle timing simulator

• A cycle-by-cycle energy estimation framework

Figure 5.1a shows the code flow through the compilation framework. Part (a)

shows the compiler back-end producing a .s file that is later used by the simulation

framework in Part (b). Part (b) shows three major sections: the Functional Emulator,

the Timing Simulator, and the Energy Model. The Functional Emulator instruments

the original x86 binary code while running it on the native processor. The dynamic

instructions generated by the emulator are reformatted into an internal Instruction

Set Architecture (ISA) and sent to the Timing Simulator. The Timing Simulator has

the model for the out-of-order (OoO), coarse-grain out-of-order (CG-OoO), and in-

order (InO) processors. The Energy Model produces per-access energy values which

the timing simulator utilizes to compute the total energy consumption of di↵erent

hardware units throughout the processor pipelines.

62

CHAPTER 5. METHODOLOGY 63

5.1 Compiler

The compiler performs Local Register Allocation and Global Register Allocation as

well as Static Block-Level List Scheduling for each program basic-block. This means

the output ISA di↵ers from the gcc generated x86 ISA. Chapter 4 discusses this ISA

in detail.

Figure 5.2 shows the compiler code processing stages. The code binary is first

generated using gcc (with -O3 optimization flag). The corresponding assembly code

is then parsed to reconstruct the program intermediate representation (IR) control

flow graph (CFG). The dominance frontier is constructed through the five sub-steps

shown in this figure. The dominance frontier sets are used to rename register operands

into static single assignment (SSA) format. Then, the liveness analysis unit finds the

live-range of each SSA operand via tracking register definition sets and use sets. The

CFG is then used to perform block-level list-scheduling followed by register allocation

and dead code elimination. The code generated for the CG-OoO processor has two

register allocation phases: local allocation and global allocation. The code generated

for the OoO and InO cores uses global register allocation only.

Block-level list-scheduling refers to performing list-scheduling on each individ-

ual basic-block separately. This avoids hoisting operations across control boundaries

speculatively; no speculative static instructions means no energy overhead for exe-

cuting additional (speculative) instructions at runtime. The static scheduler does not

perform speculative load-store reordering. Thus, store-to-load dependency detection

is not a concern.1 To identify the block critical path, the compiler assigns latency

numbers to each operation; the longest instruction sequence would then be marked

as the block critical path. When the compiler finds more available instructions than

the number of available execution units, it selects instructions in the order of their

path latency (from highest to lowest latency).

1The only instructions with unpredictable latency are memory operations. The compiler assignsL1-cache access latency (i.e. 4 cycles) to all memory operations.


Functional Emulator

(Pintool)

Ins Que Timing Simulator

EnergyModel

Perf &EnergyOutput

x86 Code

(SPEC Int 2006)

Config

Compiler(new ISA)

.scode

Config

GCC(-O3)

(a)

(b)

Figure 5.1: The evaluation framework in this work consists of three main components:the compiler back-end, the simulation infrastructure, and the energy model. The .sfile holds instructions in the ISA format.


x86 Binary Parser

Intermediate Representation

Dominators Construction

Strict Dominators

Immediate Dominators

Dominator Tree

Dominance Frontier

Static Single Assignment

Liveness Analysis

Local Register Allocation

Global Register Allocation

List Scheduling

Dead Code Elimination

Output Static Instructions(CG-OoO ISA)

Figure 5.2: The compiler back-end diagram. The first stage parses the programassembly for constructing the intermediate representation and the final stage outputsthe same code as a .s file used by the runtime simulation engine.


PinPoints / Fast-ForwardingEngine

ISA Exchange

(x86 → Internal ISA)

CodeParser

Ins Que

x86 Code

(SPEC Int2006)

PARS ISAInstructionDictionary

.sCode

Functional Emulator Software Architecture

Program Instrumentation

Unit(Pintool)

Branch Prediction Unit

Wrong-PathContext & Memory

Log Handler

Wrong-PathException Handler

GCC(-O3)

SimulationConfig

Figure 5.3: The software blocks constructing the Functional Emulator.

5.2 Functional Emulator

Figure 5.1b shows the Pintool-based functional emulator built to instrument x86 op-

erations. Figure 5.3 shows the di↵erent software components inside the Functional

Emulator. Each of these components are discussed here. This unit generates x86

instructions and dynamically reformats them according to the code information pro-

vided by the compiler (i.e. .s code file). In case of the CG-OoO execution model, the

static schedule of instructions in each basic-block is enforced by the one provided by

the compiler. The other two processors use the original code schedule produced by

gcc.

The functional emulator and the timing simulator are setup as a producer thread

and a consumer thread respectively. Once instructions are generated and reformatted

by the emulator, they are inserted into the instruction queue (i.e. Ins Que). The

timing simulator consumes instructions as they appear in the queue.


The simulation framework supports Simpoints [18] through the PinPoints [41]

API. Alternatively, it supports a fast-forwarding mode in which the simulator skips

instrumenting the initial few billion instructions before it emulates several tens of

millions of instructions. The results provided in Chapter 6 use the fast-forwarding

mode; all evaluations are configured to fast-forward 2 billion instructions, warm-up

the simulator with 2 million instructions, and run energy and performance evaluations

on the following 20 million dynamic instructions.

In this study, I use the 2bc-gskew branch predictor unit (BPU) proposed by Seznec

et al. [48]. Branch prediction is done by the emulator (instead of the timing simula-

tor). Branch prediction updates are done immediately at prediction time. To faith-

fully model the e↵ect of control mis-speculation, the emulator supports executing code

from the wrong-path. Wrong-path execution is enabled through the Pintool context

and memory address logging API. Figure 5.4 shows the state transition diagram for

switching between wrong-path and right-path states. It also shows the sequence of

events happening upon each state transition event and highlights the key Pintool API

at each step. At every state transition between the right and wrong-path, register

context and memory data must be tracked. Register context information is tracked

through the Pintool API (PIN SetContextReg (), PIN SaveContext ()). The mem-

ory data, however, is maintained as a data log in the emulator analysis code. Memory

logging is enabled only during the right-path state; at the start of every right-path

state, the memory log from the previous right-path state is restored to undo any

potential memory overwrite during the wrong-path state. The memory log is then

reset to prepare for logging memory writes during the current right-path state.

In practice, upon every branch mis-prediction event, some number of wrong-path

instructions are fetched from the instruction cache; this number varies based on sev-

eral parameters including the processor width, number of instructions in the pipeline,

and the type of mis-speculated control instruction. In this study, as a design simplifi-

cation, the number of wrong-path instructions on control mis-speculations is set to a

fixed number. The number of wrong-path instruction is a configuration parameter in

the emulator. This number is set to 20 for the results reported in Chapter 6. About

20 wrong-path instructions are required to resolve a mis-speculation event given a


Right-Path → Wrong-Path Context Switch

Save Right-Path ContextPIN_SaveContext ()

Switch Instruction Pointer Register to Wrong-PathPIN_SetContextReg ()

Start Execution at Wrong-PathPIN_ExecuteAt ()

WrongPath

RightPath

Control Mis-speculation

Memory Write LoggingPIN_SafeCopy ()

20 Instruction Limit OR

Exception

Disable Memory-Write Logging

Wrong-Path → Right-Path Context Switch

Enable Memory Write LogPIN_SafeCopy ()

Restore Memory-Write LogPIN_SafeCopy ()

Reset Memory-Write Logger

Start Execution at Right-Path ContextPIN_ExecuteAt ()

Figure 5.4: The wrong-path / right-path state transition diagram along with thesequence of events that take place at every transition.

4-wide processor with 5 pipeline stages between the Decode and Execute.

In most cases, the emulator succeeds at executing 20 wrong-path instructions,

but in certain cases, before reaching the threshold, it is retracted by the Wrong-Path

Exception Handling unit. For example, if the wrong-path code is the catch side of

an exception handling block, or if the wrong-path code accesses some unallocated

memory address, it may retract from the wrong-path earlier. Figure 5.5 shows the

average size of wrong-path code sequences for SPEC Int 2006 benchmarks. In most

cases, the wrong-path unit reaches the expected number of instructions.

5.3 Timing Simulator

The timing simulator is a consumer thread created by the emulator thread. The

emulator runs ahead and inserts instrumented instructions into the Ins Que. When


0"

2"

4"

6"

8"

10"

12"

14"

16"

18"

20"

CG*OoO" OoO" InO"

Average'Num

ber'o

f'Wrong0Path'Instruc7on

s'

Average'Number'of'Instruc7ons'on'the'Wrong0Path'(SPEC'Int'2006'Benchmarks)'

Figure 5.5: The average number of operations on the wrong-path for each proces-sor model. To produce these results, the average number of program wrong-pathoperations for all SPEC Int 2006 benchmarks are averaged.

the number of dynamic instructions reaches a maximum threshold (in this study

1,000 instructions), it would switch execution to the timing simulator to consume the

instructions in the Ins Que using the Pin Semaphore API. Then, the timing simulator

thread would consume the instructions in the Ins Que before transferring the control

back to the producer thread. The Ins Que is provisioned to hold an arbitrary number

of instructions. Here, it holds a thousand instructions as shorter queue sizes lead

to high thread context switching overhead which ultimately impacts the simulation

time.

The timing simulator consists of a cycle-by-cycle pipeline architecture for the

OoO, CG-OoO, and InO execution models as well as their cache memory hierarchy.

Tables 5.1 and 5.2 outline the system configurations used in this study. Figure 5.6

shows the pipeline structure simulated for each of the three processor models.

Each major pipeline stage (i.e. branch prediction, fetch, decode, rename, issue,

execute, commit) is a separate class object. Instruction objects flow through the

stages via instruction queuing class objects called port. The lifetime of instructions

in a port depends on the latency of the preceding stage. For example, if the Decode

stage is configured to take three cycles to complete decoding an operation, each


operation will only become available to the Issue class three cycles after it arrives at

the Decode stage. In the mean time, the instruction stays in the Decode outgoing

port. The port class can be configured for receiving and delivering an arbitrary

number of instructions per cycle. It is, however, expected to be as wide as the width

of the pipeline stages it connects.

The simulator timing is tracked via an internal clock. At every cycle, all pipeline

stages, in reverse order, get a chance to process one cycle worth of instructions. For

instance, a 4-wide Execute stage would complete executing up to four add operations

in one cycle.

As mentioned earlier, control speculation happens in the producer thread. Upon

every control operation, the BPU produces a prediction which is then compared

against the instrumented outcome from Pin. In case of a mis-prediction, the wrong-

path unit changes the execution context to execute wrong-path instructions as de-

scribed earlier; otherwise, the emulator resumes on the right-path. In either case,

emulated instructions are reformatted to the internal API provided by the compiler

and inserted into the Ins Que. Mis-speculated operations and wrong-path operations

are marked with a mis predict flag so that when they are processed by the timing

simulator, they can be identified by the squash handling unit.

Figure 5.7 shows the states transition diagram for squashing instructions in the

timing simulator. When the simulator detects a mis-speculated instruction, it moves

from the Normal Execution state to the Flush Younger Ops state; as the state name

suggests, all instructions younger than the mis-speculated operation are removed from

the re-order bu↵er (ROB); wrong-path instructions are deleted and the rest of the

instructions are pushed back to the Ins Que to wait until the simulator returns to the

Normal Execution state. After all younger instructions are flushed from the ROB,

the simulator moves to the Drain Older Ops state to commit all instructions older

than the mis-speculated instruction. As expected, mis-speculated control operations

are committed along with older operations, but mis-speculated memory operations

are pushed back to the Ins Que for re-execution with correct and up to date data.2

2Notice the squash handling mechanism discussed here only concerns the timing simulator andnot the functional emulator.


Shared Parameters SizesISA x86Technology Node 22nmSystem clock frequency 2GHzL1 Cache size, associativity, latency 32KB, 8, 4 cyclesL2 Cache size, associativity, latency 256KB, 8, 12 cyclesL3 Cache size, associativity, latency 4MB, 8, 40 cyclesMain memory latency 100Instruction Fetch Width 1 to 8Branch Predictor Hybrid- G-Share Size 8Kb- Bimodal (BIM) 8Kb- Meta 8Kb- Global Pattern Hist 13bBTB Number of Entries, associativity 4096, 8BTB Tag Size 16b

Table 5.1: System parameters shared between all core architectures

Fetch Decode Rename Issue Schedule RegRd Execute Mem WB Commit

ROB

RFInst WinInst Win

RFRegisterRenameDecoder

BPPC

ICache

OoOPipeline

CG-OoOPipeline

Bypass DCache

LSU

Fetch DecodeRename

Ins Steer Schedule Reg Rd Execute Mem WB / Complete

BROB

DCache

BWDecoder

BPPC

ICacheBypass

Commit

GRFRegister Rename

LRF

LRF

GRF

BW

BW

SCHEDULE

LSU

Blk Alloc

RegRd Execute Mem WB

RF DCache

SQRF

Decoder

BPPC

ICache

InOPipeline

Bypass

Fetch Decode

SCHEDULE

Figure 5.6: Processor pipeline stages modeled in the timing simulator.


In-Order Processor SizesPipeline Depth 7 cyclesInstruction Queue 8 entries, FIFORegister File 70 entries (64bit)Execution Unit 1-8 wideOut-of-Order ProcessorPipeline Depth 13 cyclesInstruction Queue 128 entries, RAM/CAMRegister File 256 entries (64bit)Execution Unit 1-8 wideRe-Order Bu↵er 160 entriesLoad/Store Queue 64 LQ entries, 32 SQ entries, CAMCoarse-Grain Out-of-Order ProcessorPipeline Depth 13 cyclesBlock Window (BW) Count 3-18Instruction Queue / BW 10 entries, FIFOHead Bu↵er / BW 2-5 entries, RAM/CAMExecution Unit / BW 1-8 wideLocal Register File / BW 20 entries (64bit)Global Register File (GRF) 256 entries (64bit)GRF Segments 1-18Number of Clusters 1-3 clustersBlock Re-Order Bu↵er 16 entriesLoad/Store Queue 64 LQ entries, 32 SQ entries, CAM

Table 5.2: System parameters for each individual core


Figure 5.8 shows a simple dynamic code sequence example. Instructions 1 through 5

are already fetched and allocated in the re-order bu↵er (ROB). Instructions 6 through

8 are pending fetch in the Ins Que. Instruction 2 is a branch that is marked as mis-

speculated by the BPU in the emulator, and instructions 3 and 4 are from the wrong-

path. When instruction 2 reaches the Execute state, the processor detects that it was

mis-speculated. Then, it initiates the squash sequence. During the Flush Young Ops

state, instructions 3 and 4 are deleted and 5 is pushed back to Ins Que. During the

Drain Older Ops state, instructions 1 and 2 are completed and committed.

To handle precise exceptions, the processor is capable of issuing instructions in-

order upon exceptions. Once recovered from the exception, the program processor

resumes its normal execution model.

5.4 Energy Model

In this study, the energy model supports table access, cache access, wire, stage reg-

ister, and execution unit energies. The energy model estimates per-access dynamic

energy and per-cycle static energy consumption for each unit. The energy model

also estimates the area for each unit. The remaining parts of the simulator, such as

logic blocks and control units, are assumed to have similar energy costs for both the

baseline OoO and the CG-OoO model, and to have secondary e↵ect on the overall

system energy.

The measurement techniques used to estimate the energy for each of the above-

mentioned units is described next. All techniques produce per-access dynamic and

static energy numbers. These numbers are imported to the timing simulator as con-

figuration parameters (see Figure 5.1). Then, the timing simulator computes the total

dynamic energy consumption for each hardware unit; upon each access to the unit,

the simulator increments the energy consumption by the per-access dynamic energy

provided by the energy model. The total static energy consumed by each unit is

estimated at the end of the simulation when the simulator multiplies the number of

simulation cycles by the per-cycle leakage energy of that unit.


NormalExecution

FlushYounger

Ops

DrainOlder Ops

EmptyROB

Mis-speculationDetected

Wait until allops commit

Start

Done flushing younger ops from

pipeline

Figure 5.7: The state transition diagram for the squash mechanism in the timingsimulator. This unit handles both memory and control mis-speculation events.


5.LOAD

8.SUB

7.LOAD

6.ADD

4.ADD

1.STORE

3.STORE

2.BRANCH ROB

Ins. Que.

Head of Queue

Head of Queue (Older Ops)

5.LOAD

8.SUB

7.LOAD

6.ADD

ROB

Ins. Que.

Head of Queue

Head of Queue (Older Ops)

Wrong-Path OpRight Path Op

Mis-Speculated Op

BeforeSquash

AfterSquash

Figure 5.8: A simple dynamic code sequence example showing operations in the InsQue and the re-order bu↵er (ROB). Before detecting branch mis-speculation, thebranch and its corresponding wrong-path instructions are in the ROB. After thesquash is detected and handled, the ROB will be empty, and the younger right-pathoperations would be pushed back to the Ins Que. Wrong-path instructions would beeliminated from the sequence without committing.


RAM, CAM, FF, CacheModels Script

SPICESimulation

Energy-per-access & AreaOutput

.ckt

SPICEFile

.mt0

OutputFile

.yaml

ConfigFile

Energy Model Software

Figure 5.9: The evaluation pipeline to produce per-access energy numbers for di↵erentSRAM tables and flip-flop arrays using SPICE simulations. Additional steps includingarea estimation, energy scaling for di↵erent port configurations, and energy modelingfor cache structures are done by the Models Script. The YAML configuration fileholds necessary table specification parameters for the energy model (see Figure 5.11).CAM: content addressable memory, RAM: random access memory, FF: flip-flop.

5.4.0.3 Energy Model Structure

The energy model in this work, consists of three major units:

• SRAM Tables & Flip-Flop Arrays Energy Model 3

• Execution Units Energy Based on the Synopsys Design Compiler

• Wire Energy Model Based on HotSpot [23]

In this section, the details associated with each of these energy model units is

discussed with a heavier focus on the in-house energy model built for modeling the

energy of flip-flop arrays and SRAM tables.

Random Access Memory (RAM) Tables: RAM tables are designed as standard

SRAM units accessed through decoder and read through sense amplifiers. Static and

dynamic energy are generated using SPICE. After the SPICE simulation results are

produced by the Models Script, additional steps including area estimation, energy

3This energy model is built in collaboration with Subhasis Das [14] at Stanford University.


scaling for di↵erent port configurations, and energy modeling for cache structures are

done.

SRAM area is dependent on the decoder design. In this study, the decoder design

consists of a global decoder, a pre-decoder, and a local decoder. This area is computed

through Equations 5.1, 5.2, 5.3 where waddr

is the number of address bits, wlocal

is the

number of bits decoded on each local decoder, wout�word

is the number of word-bits

read on each access, L is the technology node size, Pw

, Pr

are the number of write

and read ports respectively, and �h

, �w

are table dimensions (height and width) in

lambda. Layout parameters follow the ITRS [25] standard for the 22nm technology

node.

HSRAM

= L⇥ �h

⇥ (Pw

+ Pr

)⇥ 2waddr

�w

local (5.1)

WSRAM

= L⇥ �w

⇥ (2⇥ Pw

+ Pr

)⇥ wout�word

⇥ 2wlocal (5.2)

ASRAM

= HSRAM

⇥WSRAM

(5.3)

The per-access energy numbers for all tables and pipeline registers are produced

through applying the pulse signal shown in Figure 5.10 to each SPICE circuit. Phase

I captures the dynamic switching energy and Phase II captures the static leakage en-

ergy. The dynamic and static energy generated in this step are measured for accessing

one word (i.e. 64 bits) of a SRAM table. These numbers are then post-processed to

produce final per-access estimates for di↵erent table configurations. Configuration

parameters including the number of ports, number of table entries, and row / column

dimensions, word-width, etc. are shown in Figure 5.11.

Figure 5.12a shows the normalized energy-per-access for di↵erent register file sizes,

and Figure 5.12b shows the corresponding area scaling for the same tables. Addition-

ally, Figures 5.13a and 5.13b show the normalized energy-per-access and area of a

256-entry SRAM table with three di↵erent read/write port configurations commonly

used in this study. The energy-per-access for a 256-entry SRAM table with 64-bit

words, and 8 read and 4 write ports is 7.6pJ and its area is 1.53mm2. The SRAM


0

Vdd

Time (ns)1 2 3 4

DynamicEnergy

Measurement

Static Energy

Measurement

SPICEMeasurement

Signal

Figure 5.10: The measurement signal used in the SPICE models to measure dynamicand static energy per access.

layout parameters follow the ITRS [25] standard for the 22nm technology node.

Content Addressable Memory (CAM) Tables: CAM tables are designed as

standard SRAM units accessed through a driver input module and read through

sense amplifiers. Energy estimates for CAM tables are generated through SPICE in

the same way to RAM tables. The same set of equations in the previous section are

used to estimate the area of CAM arrays.

Figure 5.14 illustrates the CAM and RAM array models used in this study. Notice

RAM and CAM tables operate in opposite ways; for the RAM table, the decoder

drives the inputs to the table through the wordlines. For the CAM table, on the

other hand, it is the searchline that is driven to trigger a table read. In other words,

wordline and serachline drive inputs while bitline and matchline drive outputs.

To estimate the CAM access energy based on the RAM models, Equations 5.4

and 5.5 are used; bl, wl, ml, and sl refer to the RAM bit-line, RAM word-line, CAM

match-line, and CAM select-line respectively, and n refers to the number of input

bits to the CAM table (i.e. 64 bits). Equation 5.5 assumes, on average, half of the


## Array Size Configurationsarray: word_width: 64 # output width in bits num_words: 256 # number of words, each word is word_width wide cin: 12 # max input capacitance in terms of lambda

arrays: num_array: 1 # number of tables

## SRAM Cell Model Parameters## lambda = min feature size = 0.5 * technology nodecell_model: rd_port: 2 # number of read ports (must be 1 at least) wr_port: 2 # number of write ports cin: 8 # wordline transistor input capacitance in lambda w: 16 # width (along wordline) in lambda h: 40 # height (along bitline) in lambda

## Local Wire Parameterswire_model: pitch: 0.14 # width + spacing in um width: 0.07 # spacing in um height: 0.26 # from next lower layer in um thickness: 0.125 # vertical dimension thickness in um

## Global Wire Parametersglobal_wire_model: eperl: 0.08 # energy/bit/mm (pJ/mm), assuming activity factor of 0.25 tperl: 0.3 # delay per mm in ns/mm

Figure 5.11: An example YAML configuration file for the energy model showingSRAM cell parameters, wire parameters, and table size configurations for a 256-entryregister file with 2 read and 2 write ports.


0%#

20%#

40%#

60%#

80%#

100%#

RF64#### RF128#### RF256#####

Normalized

+Ene

rgy+

SRAM+Energy+/+Access+(Size+Sweep)+

(a)

0%#

20%#

40%#

60%#

80%#

100%#

RF64#### RF128#### RF256#####

Normalized

+Area+

SRAM+Area+(Size+Sweep)+

(b)

Figure 5.12: Register File energy and area scaling as the number of entries grows from64 to 256. The energy-per-access and area figures are normalized to the 256-entryregister file. The energy-per-access for a 256-entry SRAM table with 64-bit words,and 8 read and 4 write ports is 7.6pJ and its area is 1.53mm2. The SRAM layoutparameters follow the ITRS [25] standard for the 22nm technology node.

0%#

20%#

40%#

60%#

80%#

100%#

2R2W# 4R4W# 8R4W#

Normalized

+Ene

rgy+

SRAM+Energy+/+Access++(Port+Sweep)+

+

(a)

0%#

20%#

40%#

60%#

80%#

100%#

2R2W# 4R4W# 8R4W#

Normalized

+Ene

rgy+

SRAM+Area+(Port+Sweep)+

+

(b)

Figure 5.13: Register File energy and area scaling as the number of entries growsfrom 64 to 256. The energy-per-access and area figures are normalized to the 8R4Wregister file. The energy-per-access for a 256-entry SRAM table with 64-bit words,and 8 read and 4 write ports is 7.6pJ and its area is 1.53mm2. The SRAM layoutparameters follow the ITRS [25] standard for the 22nm technology node.


RAM Array

Decoder

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Wordline

Bitline Bitline’

Address

Enco

der

BitlineSense

Amplifiers

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Matchline

MatchAddress

MatchlineSense

Amplifiers

Searchlines

Searchline Driver Input

Vdd

Matchline

WordlineSearchline Searchline’

Vdd

WordlineBitline Bitline’

CAM Cell

RAM Cell(a) (b)

(c) (d)

RAM Array

CAM Array

Figure 5.14: (a) a random-access memory (RAM) table; the corresponding cell modelis shown in (b). (c) a content addressable memory (CAM) table; the correspondingcell model is shown in (d).


Clk

Data

Q

Q’

Positive Edge-Triggered Flip-Flop

.subckt nand2 a b y vdd_lmp1 y a vdd_l vdd_l pmos L=1 W='Wp'mp2 y b vdd_l vdd_l pmos L=1 W='Wp'mn1 y a n1 0 nmos L=1 W='Wn'mn2 n1 b 0 0 nmos L=1 W='Wn'.ends

.subckt nand3 a b c y vdd_lmp1 y a vdd_l vdd_l pmos L=1 W='Wp'mp2 y b vdd_l vdd_l pmos L=1 W='Wp'mp3 y c vdd_l vdd_l pmos L=1 W='Wp'mn1 y a n1 0 nmos L=1 W='Wn3'mn2 n1 b n2 0 nmos L=1 W='Wn3'mn3 n2 c 0 0 nmos L=1 W='Wn3'.ends

** Library name: ff** Cell name: ff.subckt flfp data clk q vddgxnand1 o4 o2 o1 vddg nand2xnand2 o1 clk o2 vddg nand2xnand3 o2 o4 clk o3 vddg nand3xnand4 o3 data o4 vddg nand2xnand5 o2 o6 q vddg nand2xnand6 o5 o3 o6 vddg nand2.ends

** 64 FF's modeled herexa d clk q vddgl flfp M=64

Figure 5.15: The 6-NAND gate positive edge-triggered flip-flop (FF) circuit and itscorresponding SPICE code. In this example, 64 FF’s are modeled.

CAM input bits are toggled on each access, hence the n/2 coe�cient.

Ebl

= Eml

(5.4)

Esl

= (n/2)⇥ Ewl

(5.5)

Stage Registers A SPICE model for a 6-NAND gate positive edge-triggered flip-

flop (FF), shown in Figure 5.15, is used to evaluate the energy and area of pipeline

stage registers. This Figure also shows the corresponding SPICE code to simulate

a 64-bit FF. The corresponding per-access dynamic and static energy numbers are

measured via the pulse measurement analysis shown in Figure 5.10. To account for

the activity factor of each pipeline stage entry, the dynamic energy calculation for all

stage registers assumes half of the FF transistors toggle on each access.


Execution Units (EU) Di↵erent 64-bit execution units including the add, multi-

ply, divide units for arithmetic and floating-point operations are developed in Verilog

and simulated in the Design Compiler. The Design Compiler provides per-operation

energy numbers for each unit. Like the previous cases, the simulator uses this in-

formation as unit-energy values to compute the contribution of EU’s to the overall

processor energy.

Wire Energy HotSpot [23] is used for optimal chip floorplan and wire length opti-

mization. To configure HotSpot, two pieces of information are required: (a) processor

connectivity between di↵erent units in terms of the number bits communicated be-

tween them, and (b) the average energy consumed by each hardware unit in the

processor pipeline. The connectivity information for the CG-OoO, OoO and InO

processors is provided to HotSpot according to the pipeline configurations shown in

Figure 5.6. The average energy numbers are collected through the SPICE and Design

Compiler simulations previously described. HotSpot uses these parameters to find the

optimal floorplan arrangement for each processor. It optimizes the foorplan based on

three key parameters: area, temperature, and wire length. In this study, HotSpot

is configured to prioritize area over wire length over temperature. To extract wire

energy numbers from HotSpot, the software is upgraded to convert wire length data

to wire energy values. The energy per access used for wires is 0.08 pJ/b-mm at the

22nm technology node. Once wire energy data is available for all major wires in the

processor floorplan, the numbers are ported to the simulator. Like the previous cases,

these are per-access energy numbers. So, each time the simulator drives a particular

wire, the energy consumed over that wire is incremented. For instance, upon access-

ing a 64-bit wire, the total wire energy is incremented by half of the per access wire

energy; this is because I assume, on average, half of the wire bits are toggled on each

access. The total wire energy consumption numbers are reported at the end of each

simulation.


5.5 Chapter Summary

This chapter described the evaluation framework used to study the CG-OoO, OoO,

and InO execution models. It provided detailed discussions on the design aspects of

the compiler framework developed in-house, the functional emulator software built on

top of the Pintool [31] API, and the timing simulator built in-house. It also presented

the energy model used to perform energy studies in this work; some aspects of the

energy model are build on top of existing software technologies such as the Design

Compiler and HotSpot [23], and other aspects are designed in-house.

Chapter 6

Coarse-Grain Out-of-Order

Evaluation

The CG-OoO processor reaches the performance of the OoO processor at 52% of

its total energy cost. This chapter quantifies the energy benefits of the CG-OoO

processor and the pipeline stages that contribute to its superior energy profile. It

also evaluates the sources of performance gain in detail.

6.1 Sources of Energy Cost

Figure 6.1 shows the energy overhead of the OoO processor compared to the InO

processor. Half of the energy overhead is in table accesses associated with dynamic

instruction scheduling; the remaining half is broken into accessing tables that are

either expensive to access due to their large size, or are accessed too frequently. The

register file and re-order bu↵er are examples of large and expensive-to-access tables;

the register rename tables and branch prediction unit tables are examples of tables

accessed more frequently than necessary.

Figure 6.2a depicts the pipeline stages for a standard out-of-order (OoO) pro-

cessor, and Figure 6.2b shows its energy breakdown. The pipeline stages are color

coded to match their corresponding stage energy pie. This figure suggests the OoO

energy consumption is fairly evenly distributed across all pipeline stages. As a result,

85

CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 86

27%$ 27%$

36%$

18%$

19%$

0%$

20%$

40%$

60%$

80%$

100%$

InO$ OoO$

Normalized

+EPC

+

Normalized+Average+Energy+Per+Cycle+(SPEC+Int+2006)+

High$Frequency$Table$Access$

High$Cost$Table$Access$

Dynamic$Scheduling$

Base$

Figure 6.1: Harmonic mean energy per cycle (EPC) of the SPEC Int 2006 benchmarksrun on in-order and out-of-order processors. The OoO energy overhead is divided intothree major categories.

an energy e�cient architecture alternative must enable saving energy on nearly all

pipeline stages. The CG-OoO execution model enables energy saving opportunities

throughout all pipeline stages, leading to 48% less average energy overhead compared

to the OoO processor at the same performance level.

6.2 CG-OoO Design Characterization

At a high level, the CG-OoO processor di↵ers from the OoO processor in e�ciently

storing in-flight operations and e�ciently accessing register file data. To do so, it

utilizes a number of tables; namely the BW FIFO queues, BW Head Bu↵ers, BW

LRF’s, and BROB. In this section, I discuss the approach in designing each table

size.

Figure 6.3a shows the average number of dynamic code blocks; on average, four

code blocks would be in-flight. Figure 6.3b shows the average number of instructions

per code block is ten. These figures suggest the average number of in-flight oper-

ations varies between 30 to 185 depending on the benchmark characteristics. The

average over all SPEC Int 2006 benchmarks is 40 in-flight instructions. The CG-OoO


L1 Instruction Cache

L1 Data Cache

L2Cache

FetchPC

BranchPrediction

Decode

Rename

Dispatch

ROB

Instr.Window

Scheduler

EU EU EU EULSU

RegisterFile

(a)

13%$

15%$

9%$

31%$1%$

15%$

13%$

4%$

OoO#Energy#Breakdown#

branch$pred$

fetch$

decode$/$rename$

issue$

execute$

memory$

reg$file$

commit$

(b)

Figure 6.2: The energy breakdown of the OoO processor mapped to its pipelinestages. The pie chart shows the energy of the execution stage is about 1% of the totalenergy; it encapsulates the energy of the execution pipeline stage, wires, and EU’s.The chart also shows the OoO energy breakdown is quite well distributed across allpipeline stages.

architecture evaluated in this chapter is designed to support up to 9 block windows,

each holding up to 15 dynamic instructions. When a code block has more than 15

instructions, the front-end stalls until earlier instructions in the corresponding BW

are issued. 1

Each Head Bu↵er holds up to four operations. My experiments show that upon

a head-of-queue stall situation, often times, an independent instruction exists within

a distance of three operations in the same BW that can hide the stall latency. Thus,

as shown in Section 6.3, a Head Bu↵er with 4 instruction entries delivers enough

instruction-level parallelism (ILP) to reach the performance of the OoO processor. 2

The number of entries in each LRF is set to 20. It turns out 20 registers is su�cient

to avoid register spilling in the SPEC Int 2006 benchmarks. Reducing register spilling

1Notice dynamic splitting of a large code block is not permissible because it may lead to violatinglocal register communication assumptions made by the compiler.

29 BW’s each with 15 instruction queue FIFO’s and 4-entry head bu↵ers can hold up to 171in-flight instructions


0"1"2"3"4"5"6"7"8"9"

"Perl"

"Bzip2""Gcc""Mcf"

"Gobmk"

"Hmmer"

"Sjeng"

"Libquantum

"

"H264ref"

"Omnetpp"

"Astar"

"Xalancbmk"

Average"

Avg.%Num

ber%o

f%BWs%

Average%Number%of%Ac3ve%BWs%

(a)

0"

5"

10"

15"

20"

25"

"Perl"

"Bzip2""Gcc""Mcf"

"Gobmk"

"Hmmer"

"Sjeng"

"Libquantum

"

"H264ref"

"Omnetpp"

"Astar"

"Xalancbmk"

Average"N

umbe

r'of'Instruc/o

ns''/'Block'

Average'Block'Size'

(b)

Figure 6.3: (a) the average number of in-flight code blocks in the CG-OoO for SPECInt 2006. (b) the average number of operations per dynamic code block.

reduces the need for additional MOV operations which in turn saves energy.

The number of BROB entries is set to 16. As mentioned in Chapter 4, BW’s

become available as soon as their last instruction is issued for execution. Thus,

at runtime, the number of in-flight code blocks can be larger than the number of

processor BW’s. In this chapter, all evaluation results use a 16-entry BROB to avoid

structural hazards due to BROB size.

6.3 CG-OoO Performance Analysis

As discussed in Chapter 4, in the CG-OoO processor, instruction-level parallelism

(ILP) is extracted from multiple sources; namely static block-level list scheduling, dy-

namic block level parallelism (BLP), and limited dynamic instruction level parallelism.

Although all processors presented in this evaluation are 4-wide superscalar ma-

chines, the CG-OoO model supports 12 EU’s spread across 3 clusters. In other words,

the CG-OoO front-end is 4-wide, but the total number of execution units is 12. The

higher availability of computation resources allows exploiting more instruction level

parallelism. This can be seen for Hmmer, Bzip2, and Libquantum benchmarks in

Figure 6.4.

Figure 6.5 illustrates the case where the CG-OoO is entirely 4-wide; that is, the


1.03% 1.08% 1.00% 0.99% 0.91%1.13%

0.95%

1.42%

0.82%0.93% 0.95% 1.05% 1.01%

0.0%0.3%0.5%0.8%1.0%1.3%1.5%

%Perl%

%Bzip2%

%Gcc%

%Mcf%

%Gobmk%

%Hmmer%

%Sjeng%

%Libquantum

%

%H264ref%

%Omnetpp%

%Astar%

%Xalancbmk%

%Harm.%Mean%

Speedu

p&Performance&

(OoO&Baseline&5&Width=4)&

CGKOoO% InO%

Figure 6.4: Performance of the CG-OoO and InO processors normalized to the per-formance of a 4-wide OoO processor. Here, the performance is measured in terms ofinstructions per cycle (IPC). All processors are configured to have a 4-wide front-end.

front-end is 4-wide and the number EU’s is also 4 (on a single cluster). In this case,

the CG-OoO performance is throttled to 7% less than the 4-wide OoO processor

baseline.

The first source of performance gain is static block-level list scheduling. Figure 6.6

shows the e↵ect static scheduling on performance. On average, static scheduling

increases the CG-OoO performance by 14%. In case of Hmmer, 19% more memory

level parallelism is observed with the original binary schedule generated using gcc

(with -O3 optimization flag) than the schedule generated using the block-level list

scheduling compiler pass. The higher memory level parallelism is due to a superior

global code schedule which in turn leads to fewer stall cycles due to better inter-BW

data communication through the GRF. In both cases, Hmmer performs better than

the OoO baseline model.

The next source of performance gain is through block level parallelism3. To illus-

trate the contribution of block level parallelism, let us assume each BW can issue up

to four operations in-order; that is, if an instruction at the head of a BW queue is

not ready to issue, younger, independent operations in the same queue do not issue.

Other BW’s, however, can issue ready operations to hide the latency of the stalling

3Block level parallelism is defined in Chapter 3.


0.99$ 0.94$ 0.95$ 0.99$0.88$

0.99$0.93$ 0.96$

0.78$0.90$ 0.90$

0.81$0.93$

0$

0.2$

0.4$

0.6$

0.8$

1$

1.2$

$Perl$

$Bzip2$

$Gcc$

$Mcf$

$Gobmk$

$Hmmer$

$Sjeng$

$Libquantum

$

$H264ref$

$Omnetpp$

$Astar$

$Xalancbmk$

$Harm.$Mean$

Speedu

p&

Performance&(OoO&Baseline&5&45Wide&Machines)&

CGLOoO$ InO$

Figure 6.5: Performance of the CG-OoO and InO processors normalized to the per-formance of a 4-wide OoO processor. Here, the performance is measured in terms ofinstructions per cycle (IPC). All processors are configured to have a 4-wide front-endand back-end.

0.87%1.01%

0.4%

0.6%

0.8%

1.0%

1.2%

1.4%

1.6%

%Perl%

%Bzip2%

%Gcc%

%Mcf%

%Gobmk%

%Hmmer%

%Sjeng%

%Libquantum

%

%H264ref%

%Omnetpp%

%Astar%

%Xalancbmk%

%Harm.%Mean%

Speedu

p&

Effect&of&Sta.c&Scheduling&on&Performance&(OoO&Baseline&;&Width&=&4)&

No%Blk.%List%Sch.% Blk.%List%Sch.%

Figure 6.6: E↵ect of static block-level list scheduling on CG-OoO performance. Onaverage, the CG-OoO is 14% faster with static scheduling. In all cases, the Skipahead4 dynamic scheduling model is used.


1.03% 1.08% 1.00% 0.99% 0.91%1.13%

0.95%

1.42%

0.82%0.93% 0.95% 1.05% 1.01%

0.0%0.3%0.5%0.8%1.0%1.3%1.5%

%Perl%

%Bzip2%

%Gcc%

%Mcf%

%Gobmk%

%Hmmer%

%Sjeng%

%Libquantum

%

%H264ref%

%Omnetpp%

%Astar%

%Xalancbmk%

%Harm.%Mean%

Speedu

p&Performance&

(OoO&Baseline&5&Width&=&4)&

InO% No%Skipahead% Skipahead%2% Skipahead%3% Skipahead%4% Skipahead%5%

Figure 6.7: The performance characteristics of the Skipahead model on performance.Without Skipahead, only 17% of the gap between OoO and InO is closed. WithSkipahead 2, another 67% of the gap is closed, and with Skipahead 4, the entire gapis closed. All CG-OoO results use the statically list scheduled code.

BW. The No Skipahead bar in Figure 6.7 refers to this setup. It shows on average

17% of the performance gap between the InO and OoO is closed through BLP. Bench-

marks like H264ref and Sjeng exhibit better performance for the InO model. This is

because the InO processor has a shallower pipeline depth (7 cycles) compared to the

CG-OoO processor (13 cycles) allowing faster control mis-speculation recovery.

The last source of performance improvement in CG-OoO is the limited out-of-

order instruction scheduling within each BW; this feature is enabled through the

Head Bu↵er tables. Figure 6.7 shows the performance gain obtained via varying the

number of HB entries. Skipahead 2 refers to a HB with two entries; Skipahead 2

closes an additional 67% of the performance gap between InO and OoO. The 4-entry

HB model (i.e. Skipahead 4) closes the rest of the performance gap. No significant

performance di↵erence is observed for larger Head Bu↵er sizes.

Figure 6.8 shows the performance characteristics of CG-OoO when varying the

processor front-end width from 1 to 8. Comparing the harmonic mean results for

the OoO and CG-OoO shows the CG-OoO processor is superior on narrower designs.

A wider front-end delivers more dynamic operations to the back-end. Because the

OoO model has access to all in-flight operations, it can exploit a larger e↵ective


0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4""Perl"

"Bzip2"

"Gcc"

"Mcf"

"Gobmk"

"Hmmer"

"Sjeng"

"Libquantum"

"H264ref"

"Omnetpp"

"Astar"

"Xalancbmk"

"Harm

."Mean"

"Perl"

"Bzip2"

"Gcc"

"Mcf"

"Gobmk"

"Hmmer"

"Sjeng"

"Libquantum"

"H264ref"

"Omnetpp"

"Astar"

"Xalancbmk"

"Harm

."Mean"

CGHOoO" OoO"

Normalized

+Spe

edup

+

Performance,+Frontend+Width+Sweep+(OoO+Baseline+=+Width+=+4)+

Width"="1"

Width"="2"

Width"="4"

Width"="8"

Figure 6.8: The e↵ect of the front-end width on speedup of the CG-OoO and OoOis evaluated. 1, 2, 4, and 8-wide processor models are evaluated here. The resultsindicate that the OoO performance scales better with a wider front-end.

instruction window. Despite the larger number of in-flight operations, the CG-OoO

model maintains a limited view to the in-flight operations making an 8-wide CG-OoO

machine not much superior to its 4-wide counterpart.

6.4 CG-OoO Energy Analysis

The CG-OoO execution model augments an energy e�cient design solution to the

dynamic out-of-order execution model. In doing so, it improves energy e�ciency of

most pipeline stages. Overall, CG-OoO shows an average 48% energy reduction for

the SPEC Int 2006 benchmarks. Figure 6.9a shows the overall energy level for the

CG-OoO, OoO, and InO processors; Figure 6.9b shows the harmonic mean energy

breakdown for di↵erent pipeline stages; all benchmarks follow a similar energy break-

down trend as the harmonic mean. This figure shows the main energy savings are

in the Branch Prediction, Register Rename, Issue, Register File access, and Commit

stages. In this section, the source of energy saving within each stage is discussed.

Given the main contribution of this study is on building an energy e�cient processor


core, Figure 6.9c excludes cache and memory system energy. Excluding cache and

memory system, CG-OoO results in a 61% average energy saving versus an OoO pro-

cessor with similar performance. All processor evaluations in this work use the same

cache model.

Figure 6.10 shows the inverse of energy-delay (ED) product indicating the favor-

able energy-delay characteristics of the CG-OoO over OoO for all benchmarks even

those that fall short of the OoO performance such as Sjeng and Gobmk. The average

of the inverse of the ED product is 1.9.

Figure 6.11 shows the static and dynamic energy breakdown for di↵erent bench-

marks relative to the OoO baseline. On average, the leakage energy is smaller than

4% of the total energy.

Next, let us focus on the energy saving characteristics of each of the CG-OoO

pipeline stages.

6.4.1 Block Level Branch Prediction

Block-level branch prediction is primarily focused on saving energy by accessing the

branch prediction unit at block level granularity rather than fetch-group level gran-

ularity. For a benchmark application with average block size of eight running on a

4-wide processor, this translates to roughly 2x reduction in the number of accesses to

the branch prediction tables. Figure 6.3b shows the average block sizes for SPEC Int

2006 benchmarks. Figure 6.12 shows the relative energy-per-cycle for the CG-OoO

model compared to the OoO baseline. On average, Block Level Branch Prediction

is 53% more energy e�cient than the OoO model. Hmmer shows 83% reduction

in branch prediction energy because of its larger average code block size (see Fig-

ure 6.3b).

6.4.2 Register File Hierarchy

The CG-OoO register file hierarchy contributes to the processor energy savings in

four di↵erent ways, each of which is discussed here.

• Local Register Files are low energy tables


0%#

25%#

50%#

75%#

100%#OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

OoO

#

CG,OoO

#

InO#

#Perl# #Bzip2# #Gcc# #Mcf# #Gobmk# #Hmmer# #Sjeng# #Libquantum# #H264ref# #Omnetpp# #Astar# #Xalancbmk# #Harm.#Mean#

Normalized

+EPC

+

Normalized+Processors+Energy+(SPEC+Int+2006,+OoO+Baseline,+Width+=+4)+

(a) Normalized energy per cycle (EPC) of all processors (including the cache). On average,the CG-OoO is 48% more energy e�cient than the OoO.

0%#

25%#

50%#

75%#

100%#

OoO# CG,OoO# InO#

#Harm.#Mean#

Processor'Energy'Breakdown'Averages'(SPEC'Int'2006)'

commit#

reg#file#

memory#

execute#

issue#

decode#/#rename#

fetch#

branch#pred#

(b) Harmonic mean EPC breakdown of all processor models. This figurehighlights five major sources of energy saving in the CG-OoO processorpipeline; each of them is discussed in detail in this chapter.

42%$ 40%$46%$ 51%$

41%$31%$

41%$ 46%$34%$

41%$34%$

45%$39%$

0%$

25%$

50%$

75%$

100%$

$Perl$

$Bzip2$

$Gcc$

$Mcf$

$Gobmk$

$Hmmer$

$Sjeng$

$Libquantum

$

$H264ref$

$Omnetpp$

$Astar$

$Xalancbmk$

$Harm.$Mean$

Normalized

+EPC

+

Normalized+Core+Energy+Consump5on++(OoO+Baseline,+Width+=+4)+

CGLOoO$ InO$

(c) Normalized energy per cycle (EPC) of the core excluding the cache. On average,the CG-OoO core is 61% more energy e�cient than the OoO.

Figure 6.9: CG-OoO, OoO, and InO normalized energy per cycle (EPC) results.


1.9$2.1$

1.5$ 1.6$ 1.6$

2.3$

1.9$

2.3$

1.5$ 1.4$

2.0$

1.6$1.9$

0.0$

0.5$

1.0$

1.5$

2.0$

2.5$

$Perl$

$Bzip2$

$Gcc$

$Mcf$

$Gobmk$

$Hmmer$

$Sjeng$

$Libquantum

$

$H264ref$

$Omnetpp$

$Astar$

$Xalancbmk$

$Harm.$Mean$

Normalized

+1/ED+(1/Js)+

Energy7Delay+Product+Inverse+(O3+Baseline,+Width=4)+

CGJOoO$

Figure 6.10: The inverse of energy-delay product for the CG-OoO design normalizedto the OoO design (higher is better). Overall, the CG-OoO is 1.9x more e�cient thanthe OoO.

0%#

25%#

50%#

75%#

100%#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

CG*OoO

#

OoO

#

InO#

Perl# Bzip# #Gcc# Mcf# #Gobmk# #Hmmer# #Sjeng# #Libquantum# #H264ref# #Omnetpp# #Astar# #Xalancbmk# #Hearm.#Mean#

Normalized

+EPC

+

Sta1c+&+Dynamic+Energy+(SPEC+Int+2006)+

Dynamic#Energy# StaOc#Energy#

Figure 6.11: Static and dynamic EPC normalized to the OoO processor (lower isbetter). Overall, static EPC is about 4% of the total EPC.


49%$ 54%$

74%$ 78%$

58%$

17%$

66%$

53%$43%$

73%$63%$

78%$

47%$

0%$

20%$

40%$

60%$

80%$

100%$

$Perl$

$Bzip2$

$Gcc$

$Mcf$

$Gobmk$

$Hmmer$

$Sjeng$

$Libquantum

$

$H264ref$

$Omnetpp$

$Astar$

$Xalancbmk$

$Harm.$Mean$

Normalized

+EPC

+CG0OoO+BPU+Energy+

(OoO+Baseline+0+Width=4)+

Figure 6.12: The branch prediction unit table access EPC normalized to the OoOprocessor. Overall, the CG-OoO BPU is 53% more e�cient than that of the OoO.

• Register Rename Bypass is enabled for local operands

• Segmented Global Register Files reduce access energy

• Register Renaming optimization reduce on-chip data movement

Local register files (LRF) are statically allocated. As a result, each Block Window

holds an exclusive LRF. The 20-entry LRF energy-per-access is about 25⇥ smaller

than that of a unified, 256-entry register file in the baseline OoO processor. The

LRF has 2 read and 2 write ports and the unified register file has 8 read and 4

write ports. In addition, since each BW holds a LRF near its instruction window

and execution units, operand read and writes take place over shorter average wire

lengths. LRF’s also enable additional energy saving by avoiding local write-operand

wakeup broadcasts. Figure 6.13 shows the contribution of the local register file energy

compared to the OoO baseline; it shows an average 26% reduction in register file

energy consumption due to local register accesses.

Because local register operands are statically allocated, they do not require register

renaming. As a result, 23% average energy consumption reduction is observed in the


0%#

5%#

10%#

15%#

20%#

25%#

30%#

35%#CG

*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#

CG*OoO

#

InO#


Normalizd*EP

C*

Normalized*Register*Files*Energy*(OoO*Baseline*9*Width*=*4)*

S*GRF#9# LRF# RF#

Figure 6.13: The register file (RF) access EPC normalized to the OoO processor.Overall, the CG-OoO RF hierarchy is 94% more e�cient than that of the OoO. S-GRF 9 shows the energy of a CG-OoO GRF with 9 segments, and RF shows theenergy of an InO processor register file with the same number of ports as the OoOprocessor.

register rename stage (see Figure 6.14).

The global register file used in both OoO and CG-OoO has 256 entries. While

the use of local registers enables the use of a smaller global register file in CG-OoO

without noticeable reduction in performance, in my experiments, I use equal global

register file sizes for fair energy and performance modeling between the CG-OoO and

OoO processors.

As discussed in Chapter 3, to reduce the access energy overhead of a unified register

file and to increase the aggregate number of ports in the CG-OoO, this processor

model breaks the global register file (GRF) into multiple segments. Each segment

is placed next to a BW. The access energy to each register file segment is divided

by the number of segments relative to the OoO unified register file access energy.

Figure 6.13 also shows the contribution of the global register file energy compared

to the OoO baseline; it shows an average 68% reduction in the global register file

energy consumption due to register file segmentation. Notice GRF segmentation is

not commonly used in OoO architectures; some ARM architectures bank the register

file for various purposes such as better thread context switching support [10]. This

means the segmented GRF technique is not exclusive to the CG-OoO, and may be


0.70$0.78$

0.93$0.85$ 0.82$

0.67$

0.82$0.91$

0.67$

0.80$0.69$

0.86$0.77$

0%$

25%$

50%$

75%$

100%$

$Perl$

$Bzip2$

$Gcc$

$Mcf$

$Gobmk$

$Hmmer$

$Sjeng$

$Libquantum

$

$H264ref$

$Omnetpp$

$Astar$

$Xalancbmk$

$Harm.$Mean$

Normalized

+EPC

+Register+Renaming+Energy+(OoO+Baseline+8+Width+=+4)+

Figure 6.14: The register rename (RR) table access EPC normalized to the OoOprocessor. Overall, the CG-OoO RR is 23% more e�cient than that of the OoO.

pursued as an energy e�ciency technique in OoO processors.

Figure 6.15 shows the e↵ect of register file segmentation on energy. It shows

the case of a unified GRF, one GRF segment per cluster (for a 3-cluster CG-OoO

architecture), and one GRF segment per BW. As the number of register segments

increases, energy consumption decreases linearly.

Placing a segment next to each BW is energy saving when operations read and

write their global operands to the closest GRF segment. The register-rename unit

reduces data communication over wires by allocating to every global write operand

an available physical register from the closest GRF segment.

6.4.3 Instruction Scheduling

The CG-OoO processor introduces the Skipahead issue model. Figure 6.16 shows the

energy breakdown for the CG-OoO dynamic scheduling hardware. In OoO and CG-

OoO, in-flight instructions are maintained in queues that are partly RAM tables and

partly CAM tables. For the InO model, instructions are held in a small FIFO bu↵er.

Figure 6.16 shows the majority of the OoO scheduling energy (75%) is in reading

and writing instructions from the RAM table. Another 20% of the OoO energy is in

CAM table accesses. The “Rest” of the energy is consumed in stage registers and the


0%#20%#40%#60%#80%#100%#

#Perl#

#Bzip2#

#Gcc#

#Mcf#

#Gobmk#

#Hmmer#

#Sjeng#

#Libquantum

#

#H264ref#

#Omnetpp#

#Astar#

#Xalancbmk#

#Harm.#Mean#

Normalized

+Ene

rgy+

Segmented+Register+File+Energy+Trend+(OoO+Baseline+9+Width+=+4)+

1#Segment#GRF# 3#Segment#GRF# 9#Segment#GRF#

Figure 6.15: The segmented register file access EPC normalized to the OoO processor.As the number of GRF segments increases, the amount of energy saving decreaseslinearly.

interconnects used for instruction wakeup and select. This figure also indicates 90%

average reduction in the CG-OoO RAM table energy (relative to OoO RAM energy)

which is due to accessing smaller SRAM tables, and 95% average reduction in the

CAM table energy which is due to using 2 to 4-entry Head Bu↵ers (HB) instead of

the 128-entry CAM tables used in the OoO instruction queue. The “Rest” average

energy is increased by 40% due to the larger number of pipeline registers at the issue

stage.4 Overall, the CG-OoO issue stage is 84% more e�cient than that of the OoO.

6.4.4 Block Re-Order Bu↵er

The CG-OoO processor maintains program order at block-level granularity. This

makes read-write accesses to the BROB substantially smaller than accesses the OoO

ROB. Block write operations are done after decoding each head and block reads are

done at the commit stage. Instructions access BROB to notify the corresponding

block entry of their completion. While the access mechanism is similar to that of

4Recall that the number of EU’s for the CG-OoO is 12 (4 EU’s per cluster). Since the contributionof “Rest” to the overall energy is quite small, a 40% increase in its energy is insignificant as seen inFigure 6.16.


0%#

20%#

40%#

60%#

80%#

100%#OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#

OoO

#

CG-OoO

#

InO#


Normalized

+EPC

+

Dynamic+Scheduler+Energy+(OoO+Baseline+;+Width+=+4)+

REST# RAM# CAM#

Figure 6.16: The instruction issue EPC normalized to the OoO processor. Overall, theCG-OoO issue stage is 84% more e�cient than that of the OoO. This energy is brokeninto three main categories to better highlight the source of energy consumption.

the OoO processor, the frequency of accesses to the BROB is substantially smaller,

making it much cheaper to maintain program order in the CG-OoO.

In addition, since the BROB is designed to maintain program order at block

granularity, it is provisioned to have 16 entries rather than 160 entries commonly

used in modern OoO processors. The 10⇥ reduction in the re-order bu↵er size makes

all read-write operations 10⇥ less energy consuming. Figure 6.17 shows 76% average

energy saving for the CG-OoO model.

6.5 Cluster Analysis

As discussed in Chapter 4, the CG-OoO architecture focuses on reducing processor

energy through designing a complexity-e↵ective architecture; to remain competitive

with the OoO performance, this architecture supports a larger number of execution

units (EU). To do so, the CG-OoO model must employ a design strategy that is

more scalable than the OoO. Clustered execution was discussed in Chapter 4 as the

technique used to scale the number of execution units. A cluster consists of a number

of BW’s sharing a number of EU’s. To illustrate the e↵ect of di↵erent clustering

configurations, the experimental results in this section assume three clusters.

Figures 6.18a and 6.18b show the normalized average performance and energy


22%#25%# 25%# 27%#

23%#25%# 23%#

27%#21%#

25%#

19%#

26%# 24%#

0%#

10%#

20%#

30%#

40%#

#Perl#

#Bzip2#

#Gcc#

#Mcf#

#Gobmk#

#Hmmer#

#Sjeng#

#Libquantum

#

#H264ref#

#Omnetpp#

#Astar#

#Xalancbmk#

#Harm.#Mean#

Normalized

+EPC

+Commit+Stage+Energy+

(OoO+Baseline+8+Width=4)+

CGLOoO#

Figure 6.17: The instruction commit EPC normalized to the OoO processor. Overall,the CG-OoO commit stage is 76% more e�cient than that of the OoO.

of SPEC Int 2006 benchmarks versus the number of BW’s per cluster for various

number of EU’s per cluster. The speedup figure shows some clustering configurations

reach beyond the performance of the OoO. All clustering models exhibit substantially

lower energy consumption overhead compared to the OoO design. The most energy

e�cient configuration is the one with 1 BW and 1 EU per cluster; it is 63% more

energy e�cient than the OoO, but only 65% of the OoO performance. The most

high-performance configuration evaluated here is the one with 6 BW’s and 8 EU’s

per cluster; it is 39% more energy e�cient than the OoO, and 104% of the OoO

performance. The design configuration studied throughout this chapter corresponds

to the cross-over performance point with 3 BW’s and 4 EU’s per cluster.

Figure 6.19 shows the energy-performance characteristics of the CG-OoO model

plotting all the cluster configurations presented above. The lowest energy-performance

point in the plot refers to the 1 BW, 1EU per cluster configuration and the highest

energy-performance point refers to the 6 BW, 8 EU per cluster configuration. This

figure suggests as the processor resource complexity increases, the energy-performance

characteristics grow proportionally.


0.5$

0.6$

0.7$

0.8$

0.9$

1.0$

1.1$

1$BW$ 2$BW$ 3$BW$ 6$BW$

Speedu

p&Harmonic&Mean&Speedup&(OoO&Baseline&5&Width&=&4)&

1$EU$ 2$EU$ 4$EU$ 8$EU$ OoO$ InO$

(a)

20%$

30%$

40%$

50%$

60%$

70%$

80%$

90%$

100%$

1$BW$ 2$BW$ 3$BW$ 6$BW$

Normalized

+EPC

+

Harmonic+Mean+Normalized+Energy+/+Cycle+(OoO+Baseline+:+Width+=+4)+

1$EU$ 2$EU$ 4$EU$ 8$EU$ OoO$ InO$

(b)

Figure 6.18: Normalize Performance & Energy for di↵erent clustering configurations.All configurations assume a 3-cluster CG-OoO model; the total number of BW’s andEU’s is calculated through multiplying the above numbers by 3. Here, performanceis measured as the harmonic mean of the IPC and the energy is measured as theharmonic mean of the EPC over all the SPEC Int 2006 benchmarks.

Beyond a certain point in the scaling of this machine, the wakeup/select and load-

store unit wire latencies become so large that the energy-performance proportionality

of the CG-OoO will break. The study of identifying the energy-performance break-

point is outside of the scope of this work.

6.6 Chapter Summary

This chapter discussed the energy and performance evaluations for the CG-OoO pro-

cessor in comparison to the OoO and InO processor baselines. All evaluations are done

on the SPEC Int 2006 benchmark suite. The performance results consider various

processor widths and instruction scheduling models. The energy results are broken

down into stage-energy results to highlight the savings opportunities at each individ-

ual pipeline stage. Overall, the CG-OoO processor is 50% more energy e�cient than

the OoO processor at the same level of performance. Furthermore, in this chapter,

the energy versus performance scaling of the this processor is evaluated. Unlike the

OoO processor, the CG-OoO is an execution model that delivers energy-performance

proportionality for a wide range of execution resource configurations.


0.2$

0.4$

0.6$

0.8$

1$

0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ 1.1$

Normalized

+Pow

er+

Normalized+Performance+

Normalized+Power+vs.+Performance+

CG.OoO$ OoO$ InO$ Linear$(CG.OoO)$

Figure 6.19: The energy versus performance plot showing di↵erent CG-OoO configu-rations normalized to the OoO processor. The CG-OoO core configurations illustratethe energy-performance proportionality attribute of the CG-OoO processor. Here,performance is measured as the harmonic mean of the IPC and the energy is mea-sured as the harmonic mean of the EPC over all the SPEC Int 2006 benchmarks.

Chapter 7

Related Work

This chapter discusses the previous literature related to the subject of e�cient single-

threaded general purpose processing and explains the key di↵erentiating factors be-

tween them and the CG-OoO architecture. Overall, the CG-OoO di↵ers from the

existing architectures in its focus on building a bottom-up design heavily focused on

energy e�ciency. The CG-OoO also devises unique architectural features to reach

competitive computation performance; however, obtaining exceptional performance

remains a second priority in this work. The CG-OoO leverages compiler solutions as

well as complexity e↵ective architectural solutions that, to the authors’ knowledge,

bring the highest reported energy e�ciency results at the performance of the OoO

processor baseline.

7.1 CG-OoO Design Features

Several high-level design features distinguish the CG-OoO processor from the pre-

vious work. Table 7.1 summarizes these key di↵erentiating elements and compares

them to related architectures in the literature. The CG-OoO leverages a distributed

micro-architecture capable of issuing instructions from multiple code blocks concur-

rently (column 1). The key enablers of energy e�ciency in the CG-OoO are (a) its

end-to-end complexity e↵ective design (column 3), and (b) its e↵ective use of com-

piler assistance in doing code clustering and generating e�cient static code schedules

104

CHAPTER 7. RELATED WORK 105

(column 6). Despite the heavy reliance of the CG-OoO architecture in providing

energy e�ciency opportunities, CG-OoO requires no profiling support to deliver the

energy and performance e�ciency it o↵ers (column 4). The CG-OoO is an energy-

proportional design capable of scaling its hardware resources to larger or smaller

computation units according to the workload demands of programs at runtime. En-

ergy proportionality is provided via the clustered architecture design discussed in

Chapter 4 (column 5). The CG-OoO supports an out-of-order issue model (column

7) at block granularity and a limited out-of-order issue model at instruction granular-

ity (i.e. within block). It also supports a hierarchical register file for energy e�ciency

purposes (column 8). Furthermore, unlike most previous studies, this work performs

a detailed processor energy modeling analysis (column 2).

Braid [53] focuses on static partitioning of program basic-block instructions into

braids using the program data-flow graph. Each braid runs as an independent block of

code similar to how the CG-OoO runs its code blocks. Clustering static instructions

at sub-basic-block granularity requires additional control instructions to guarantee

accurate execution of the program control flow at runtime. Adding these instruc-

tions causes instruction cache pressure and additional energy overhead for processing

the instructions. In the CG-OoO, the compiler clusters all basic-block instructions

as a whole rather than fragmenting them into smaller clusters. Furthermore, while

Braid follows the cycle-by-cycle convention for accessing the branch prediction unit,

the CG-OoO performs block level branch prediction. Similarly, the Braid architec-

ture performs instruction-level commit and squash, whereas the CG-OoO architec-

ture supports commit and squash at block granularity to save energy. Also, each

braid execution unit can issue up to two instructions per cycle in-order (InO). The

CG-OoO architecture introduces the Skipahead issue model to improve the CG-OoO

instruction-level parallelism without causing any noticeable energy overheard. Un-

like Braid, the CG-OoO architecture also incorporates segmented global registers to

further improve the processor energy e�ciency.

WiDGET [55] is a power-proportional grid execution design consisting of a decou-

pled thread context management and a large set of simple execution units. It has a

dynamic instruction steering protocol to e↵ectively perform instruction scheduling at


DESIGN Distributedµ-architecture

/Coarse-Grain

Energy

Mod

el

Com

plexity

E↵ective

Design

ProfilingNOT

don

e

PipelineClustering

Static&

Dyn

amic

Scheduling

Out-of-O

rder

RegisterFileHierarchy

CG-OoO

Braid [53]

WiDGET [55]

TRIPS / EDGE [8,46]

Multiscalar [49]

Complexity E↵ective [39]

Trace Processors [43]

MorphCore [28]

BOLT [21]

iCFP [20]

ILDP [29]

WaveScalar [29]

Table 7.1: Eight high level design features of the CG-OoO architecture comparedagainst the related work in the literature.


runtime. WiDGET is an extension of the work by Salverda and Zilles’s [44] on design-

ing an instruction scheduling cost model. The CG-OoO is a bulk code scheduling so-

lution that follows a similar clustering design to deliver energy-proportionality while

aggressively focusing on improving energy e�ciency. WiDGET performs dynamic

data dependency detection to steer instructions to a particular execution flow while

the CG-OoO clusters instructions at compile time. WiDGET processes instructions

at block level granularity and performs out-of-order instruction issue. The CG-OoO

model replaces the out-of-order instruction issue with the Skipahead issue model that

is substantially more energy e�cient. Lastly, unlike the CG-OoO, WiDGET does not

support coarse-grain squash and commit.

Multiscalar [49] evaluates a multi-processing unit capable of steering coarse grain

code segments that are often larger than a basic-block to its processing units for com-

putation. Multiscalar replicates register context for each computation unit, increasing

the data communication across its register files. In contrast, the CG-OoO partitions

the global register file so that each computation unit holds a segment of the register

file to reduce read/write access energy. Multiscalar can be configured as an OoO or

InO processor. The CG-OoO supports limited OoO execution.

The Complexity E↵ective paper [39] proposes a distributed instruction window

that simplifies the wake-up logic, issue window, and the forwarding logic design. In

this paper, instruction scheduling and steering is done at instruction granularity,

whereas in the CG-OoO an entire code block is assigned to a block window. The

same holds for how instructions are squashed and committed.

ILDP [29] proposes an architecture for distributed processing that consists of a

hierarchical register file built for communicating short-lived registers locally and long-

lived registers globally. ILDP relies on profiling and in-order instruction scheduling

from each processing unit. The CG-OoO has a similar register file hierarchy, uses no

profiling, performs limited OoO instruction scheduling, and utilizes the coarse-grain

execution model.

TRIPS / EDGE [9] [46] is a high-performance, grid-processing architecture that

uses static instruction scheduling in space and dynamic scheduling in time. It uses

Hyperblocks [33] to map instructions to the grid of computational units. Its primary


focus is extracting as much instruction-level parallelism (ILP), thread-level parallelism

(TLP), and data-level parallelism (DLP) from the program as possible. TRIPS de-

livers high computation performance despite the multi-cycle delay for transmitting

signals over on-chip wires. Hyperblocks use branch predication to group basic-blocks

that are connected together through weakly biased branches. While e↵ective for im-

proving instruction parallelism, Hyperblocks lead to energy ine�cient mis-speculation

recovery events. The CG-OoO architecture benefits from an energy e�cient static

and dynamic instruction scheduling hybrid. To construct Hyperblocks, the TRIPS

compiler needs program profiling information. Profiling is not required in the CG-

OoO.

iCFP [20] addresses the head-of-queue blocking problem in the InO processor by

building an architecture that, on every cache miss, checkpoints the program context,

steers miss-dependent instructions to a side bu↵er enabling miss-independent instruc-

tions to make forward progress. CFP [51] addresses the same problem in the OoO

processor. It enables the use of a small register file and instruction window while

maintaining the same level of performance as conventional OoO processors. Simi-

larly, BOLT [21], Flea Flicker [3], and Runahead Execution [38] are high ILP, high

MLP 1, latency-tolerant (LT) architecture designs for energy e�cient out-of-order ex-

ecution. All these architectures follow the runahead execution model. BOLT uses a

slice bu↵er design that utilizes minimal hardware resources. Unlike iCFP, CFP, and

BOLT that perform lightweight dynamic instruction scheduling to manage available

instructions and hide LLC cache misses, the CG-OoO utilizes multiple issue queues

to provide LLC miss latency tolerance instead of a side bu↵er. Furthermore, unlike

the CG-OoO, none of these publications target an energy-proportional architecture.

Trace Processors [43] proposes a register file hierarchy and instruction flow design

based on dynamic trace processing in the front-end. Similar to the CG-OoO, the

register file hierarchy in the Trace Processors consists of several local register files and

a global register file. This architecture uses dynamic code traces to cluster instructions

for dispatch to di↵erent processing elements. The CG-OoO uses control-flow graph

(CFG) basic-blocks to cluster instructions, and performs a combination of static and

1MLP: Memory level parallelism.


dynamic instruction scheduling to save energy.

MorphCore [28] is an in-order, out-of-order hybrid architecture designed to im-

prove single-threaded energy e�ciency. It utilizes either core depending on the pro-

gram state and resource requirements. Unlike the CG-OoO, it uses dynamic instruc-

tion scheduling to execute and commit instructions. MorphCore reports 22% improve-

ment in its energy-delay product while the CG-OoO achieves over 95% energy-delay

product improvement when compared against a similar OoO processor baseline.

WaveScalar [52] is an out-of-order data-flow computing architecture that utilized

the WaveCache memory model. The WaveCache combines clusters of execution units

with small data caches and store bu↵ers to form a computing substrate. WaveScalar

focuses on solving the problem of long wire delays by bringing computation close to

data. CG-OoO is focused on designing an energy e�cient core architecture.

7.2 CG-OoO Energy E�ciency Features

My studies on the topic of energy e�cient, single-threaded computing highlighted the

need for constructing a distributed micro-architecture that turns the deep instruction

queue into a multitude of parallel small instruction queues. Such an architecture

allows building high performance and energy e�cient processors. Most such de-

signs, however, have been traditionally focused on achieving superior computation

performance rather than superior energy e�ciency. To the author’s knowledge, WiD-

GET [55] is the only work focused on evaluating the energy trade-o↵ in this class

of design. The CG-OoO advances this class of design by performing an end-to-end

redesign of the architecture based on the coarse-grain execution concept to achieve

superior energy e�ciency.

Superior energy e�ciency and competitive performance of the CG-OoO is due to

the combination of several design techniques; Table 7.2 shows the di↵erent techniques

used in this work and compares against the design techniques in previous designs.

These techniques, combined, outperform the energy e�ciency gains from previous

literature and maintain a competitive performance to that of the OoO. While many

of the design features in this work have been used in other literature, mostly to


improve processor performance, the CG-OoO showcases their strong energy e�ciency

capability when combined in the context of coarse-grain execution.

7.2.1 Degree of Coarse Granularity

Di↵erent literature have focused on di↵erent degrees of code clustering. TRIPS [46]

utilizes Hyperblocks [33] to cluster instructions, PowerPC 604 [50] utilizes Superblocks [24],

Multiscalar [49] utilizes large CFG fragments, and Braid [53] and WiDGET [55] uti-

lize sub-basic-block instruction clusters. CG-OoO focuses on coarse-grain execution

at basic-block granularity (column 8). In future chapters, I discuss that using basic-

blocks is the ideal choice of instruction granularity for energy e�cient computing.

7.2.2 Front-end Energy E�ciency

The front-end energy e�ciency features consist of (a) coarse branch prediction (col-

umn 1) which helps access the branch prediction tables upon control operations rather

than at every cycle, and (b) the register renaming bypass which limits the register

renaming table accesses to only global register operands (column 2).

Similar to the CG-OoO front-end, BLISS, FTB, and BSA perform branch predic-

tion at basic-block granularity. TRIPS / EDGE [9] [46] and Multiscalar [49] perform

branch prediction at coarser granularity.

TRIPS / EDGE [9] [46] performs coarse branch prediction at the Hyperblock

granularity. A Hyperblock may contain up to 128 operations. Instructions within a

Hyperblock are fetches, executed, and committed as a whole. Since values produces

and consumed within a block are not stored in register banks, TRIPS bypasses reg-

ister renaming and register access for a large fraction of operands. The CG-OoO

architecture finds Hyperblocks an inadequate choice for energy-e�cient computation

due to the execution of predictated operations and very large size of code blocks. The

CG-OoO manages local registers statically and maintains them in small LRF’s.

Hao et al. [19] and Melvin et al. [35] propose the Block Structured ISAs (BSA) to

improve front-end e�ciency and execution performance compared to OoO processors

that use conventional ISAs. These architectures maintain block descriptors along with


DESIGN Block

Branch

Prediction

RegisterRenam

ingByp

ass

Local

RegisterFile

SegmentedGlobal

RegisterFile

Block-R

OB

(CoarseCom

mit)

Skipah

ead

Coarse-Grain

Squ

ashMod

el

CG:@

Basic-block

CG-OoO

Braid [53] #WiDGET [55] #

TRIPS / EDGE [46] "Multiscalar [49] "

Complexity E↵ective [39]

Trace Processors [43] "BLISS [59], FTB [42]

BSA [19] [35] [56] [26] [47]

BMRF [54]

Table 7.2: Eight key micro-architectural features contributing to the energy e�ciencyof the CG-OoO architecture compared against the related work in the literature.CG in the last column refers to the level or Coarse-Granularity with respect to thegranularity of a basic-block.


the basic-block code. BLISS [59] and FTB [42] are block-level, energy-aware front-

end architecture designs that maintain block descriptors separate from the basic-block

code and replace the Translation Lookaside Bu↵er (TLB) with a BB-Cache that stores

the block descriptors in programs. These architectures improve instruction cache

energy, improve branch prediction by reducing over- and under-speculation, increase

the ratio of retired to fetched operations, and avoid unnecessary instruction fetches.

The CG-OoO design maintain block descriptors along with the basic-block code

and utilizes the TLB to fetch instructions. The choice of an e�cient front-end is

orthogonal to the focus of this work. The CG-OoO architecture delivers an execu-

tion model platform capable of providing significant energy e�ciency compared to

conventional OoO architectures. The flexible design of the CG-OoO, however, o↵ers

convenient integration opportunities to energy aware, block-level solutions, such as

BLISS.

7.2.3 Back-end Energy E�ciency

The back-end energy e�ciency features consist of (a) the register file hierarchy design

(columns 3 and 4) which is targeted towards combining static and dynamic register

allocation, (b) the unique Skipahead instruction issue design that leverages a dis-

tributed issue queue model and enables limited out-of-order issue (column 6), and (c)

a block-level re-order bu↵er to maintain program order and enable program squash

at a coarse granularity while maintaining support for precise exceptions (columns 5

and 7). Column 5 shows TRIPS maintains program order at block granularity (i.e.

Hyperblock). Column 7 shows Multiscalar and Trace Processors support coarse-grain

squash. However, due to the large size of code blocks, in both architectures, control

mis-speculations expose the pipeline depth more significantly than the CG-OoO.


7.3 OoO Energy E�ciency Arguments

Czechowski et al. [11] discusses the energy e�ciency techniques used in the recent

generation of the Intel CPU architectures (e.g. Core i7 4770K). It focuses on e↵ec-

tive use of the Micro-op Cache and Loop Cache to bypass excessive fetch and decode

activities on the processor front-end. It also describes the use of Single Instruction

Multiple Data (SIMD) ISA and other circuit-level innovations as means to improve

processor energy and performance. The CG-OoO processor questions the inherent

energy e�ciency attributes of the OoO execution model and provides a solution that

is over 50% more energy e�cient than the baseline OoO. The energy e�ciency tech-

niques discussed in [11] can also be applied to the CG-OoO model to make it even

more energy e�cient.

7.4 Energy Modeling

McPAT [30] and Wattch [7] are widely known energy model tools in the computer

architecture research community. McPAT extends and improves the modeling capa-

bilities of Wattch. I chose to build an independent energy model because of (a) the

simulation inaccuracies of McPAT [57], and (b) implementation di�culties in cor-

rectly making changes to McPAT software to model the CG-OoO processor. To build

an accurate energy model, I build an energy model using industrial-level circuit sim-

ulation tools, the Design Compiler and SPICE. My energy model does not yet have

an extensive evaluation of all processor components (e.g. logic blocks). However,

as discussed earlier, the missing hardware components have secondary e↵ect on the

overall core energy and would be designed similarly for both the CG-OoO and OoO

processors.

7.5 Simulation Framework

Gem5 [4], Marssx86 [40], and ZSim [45] are among the popular simulation frameworks

used for modeling OoO and InO processors. Gem5 has limited support for OoO


execution on the x86 ISA. Marssx86 is an extension of PTLSim [58] designed for

x86 simulations. Neither of these simulators have flexibility for modifications to the

input ISA. Recall the CG-OoO makes changes to the x86 ISA. ZSim is a Pintool-

based simulator that targets fast multi-core simulation with limited single-threaded

simulation details support. All these simulators require extensive modification to

produce cycle-by-cycle energy evaluations. As a result, they are often coupled with

McPAT for energy evaluation. As discussed in Chapter 5, my simulator is based

on Pintool and performs detailed OoO, InO, and CG-OoO single-threaded timing

simulation. It also has built-in support for cycle-by-cycle energy modeling.

7.6 Chapter Summary

In this chapter, I compared the architectural characteristics of the CG-OoO processor

against the previous literature and identified the main di↵erentiating parameters of

the CG-OoO. Braid [53] and WiDGET [55] are the two architectures closest to the

CG-OoO design; however, neither work studies all the essential design parameters for

building a highly energy e�cient single-threaded processor. To the author’s knowl-

edge, the CG-OoO design is the only architecture with over 50% reduction in energy

consumption at the same performance level as the baseline OoO processor.

Chapter 8

Conclusion

8.1 Summary of Thesis Contributions

In this thesis, I presented the Coarse-Grain Out-of-Order (CG-OoO) processor archi-

tecture. This processor is unique in its ability to save over 50% of the out-of-order

processor energy while maintaining the same level of performance. The CG-OoO con-

sists of several energy e�ciency solutions including compiler optimization techniques

to improve the code energy footprint, and hardware techniques designed to bring

energy e�ciency to the processor pipeline. The energy saving software techniques

consist of:

• Software Managed Register Allocation

• Block-level Instruction List Scheduling

Software Managed Register Allocation: the CG-OoO register file hierarchy

consists of a Local Register File model that tracks registers whose lifetime is limited

to their own basic-block. These registers are allocated and managed by the compiler.

This saves energy by allowing the processor to bypass register renaming for local

register operands, and by reducing the pressure on the physical register file (a.k.a.

the Global Register File)

115

CHAPTER 8. CONCLUSION 116

Block-level Instruction Scheduling: each code block is list-scheduled by the

compiler to generate an e�cient code schedule to run on the CG-OoO processor.

The energy saving hardware techniques focus on reducing excessive accesses to

hardware units or reducing large table sizes into either smaller tables or partitioned

tables. The partitioned tables are distributed among the on-chip Block Window

clusters. The energy saving hardware techniques consist of:

• Block-Level Branch Prediction & Fetch

• Register File Hierarchy

• Register Renaming Bypass

• Dynamic Instruction Scheduling

• Block-Level Re-Order Bu↵er

Block-Level Branch Prediction & Fetch: the branch prediction unit, in this

study, decouples control speculation from Fetch and enables significant reduction in

branch predictor lookup events without additional fetch-stall cycles. As a result, the

branch predictor energy overhead is reduced and its prediction accuracy is improved.

Improved prediction accuracy is the results of fewer accesses to the BPU which in

turn reduce mis-prediction events due to aliasing.

Register File Hierarchy: the two register file structures used in this architecture

are known as the Local Register File (LRF) and Global Register File (GRF). Each

block-window holds a small LRF that is managed by the compiler and is replicated on

each Block Window. The small size of the LRF makes it an energy e�cient table to

access. The GRF is managed by the register rename unit and is physically partitioned

to reduce its access energy and increase its e↵ective port-count.

Register Renaming Bypass: only the global register operands require register

renaming. Bypassing the register rename stage reduces the energy footprint of this

stage.


Dynamic Instruction Scheduling: the CG-OoO supports limited out-of-order

issue via two major techniques: (a) across code blocks, the presence of multiple in-

flight code blocks allows concurrent issue of instructions from di↵erent code blocks to

enable block-level parallelism, and (b) within each code block, the Skipahead schedul-

ing model allows limited and complexity e↵ective out-of-order instruction issue. The

combination of these two techniques (along with the compiler block-level list schedul-

ing) provide significant energy savings to this stage.

Block-Level Re-Order Bu↵er: unlike the out-of-order model, the CG-OoOmodel

maintains program order at block granularity. This reduces the ROB size by 10⇥ re-

ducing its energy overhead by the same amount. Squash events are also handled at

block granularity. Special provisions are made to ensure precise exception handling

is supported upon interrupt or exception events.

8.2 Future Research Directions

There are many di↵erent applications and extensions that can be applied to the CG-

OoO processor architecture and compiler.

Extension to Code Optimization: part of this research has been focused on

building a compiler infrastructure to improve energy e�ciency. While static com-

pilation helped improve the CG-OoO energy e�ciency and performance, there is

potential for performing profile-driven optimizations using, for instance, a dynamic

binary translation framework like the Nvidia Denver project [6, 13]. Extending the

CG-OoO processor to utilize profiling may enable the use of code blocks that com-

bine multiple basic-blocks into execution traces that deliver more instruction level

parallelism (ILP) and consume less energy.

Extension to Multi-threaded Applications: the CG-OoO processor presents an

execution model that delivers competitive performance with superior energy e�ciency.

This, however, is only the first step toward extending this execution model to the


multi-threaded and multi-core space where further research opportunities for studying

the compiler and architecture requirements of modern parallel applications running

on the CG-OoO environment can be evaluated.

Extension to Mobile and Server Applications: this research focuses on eval-

uating the energy and performance impact of the CG-OoO processor on the SPEC

Int 2006 benchmark which consists of mostly scientific computing applications and is

known as one of the most challenging single-threaded benchmark suites. Extending

the evaluation of the CG-OoO on server benchmark applications will highlight addi-

tional architectural design features a CG-OoO server processor may require. Likewise,

studying mobile benchmarks will identify potential design extensions necessary for the

CG-OoO to be adopted in the mobile processor space.

Extension to Hardware Scaling: the clustered execution model design in the

CG-OoO processor enables two key benefits: (a) energy-performance proportionality,

and (b) scalable execution. The scalable design allows utilizing just as many Block

Window clusters as necessary to extract ILP. In case of a critical-path bound code,

for instance, the processor may not benefit from fetching too many code blocks when

none of them can contribute to the ILP. A Dynamic Resource Management Unit

that adjusts scheduling and execution resources to the runtime application ILP may

further improve the processor energy footprint without sacrificing performance.

Bibliography

[1] White paper: NVIDIA Charts Its Own Path to ARMv8. Technical report, Tirias

Research, 08 2014.

[2] Omid Azizi, Aqeel Mahesri, Benjamin C Lee, Sanjay J Patel, and Mark Horowitz.

Energy-performance tradeo↵s in processor architecture and circuit design: a

marginal cost analysis. In ACM SIGARCH Computer Architecture News, vol-

ume 38, pages 26–36. ACM, 2010.

[3] Ronald D Barnes, Shane Ryoo, and Wen-mei W Hwu. Flea-flicker multipass

pipelining: An alternative to the high-power out-of-order o↵ense. In Proceedings

of the 38th annual IEEE/ACM International Symposium on Microarchitecture,

pages 319–330. IEEE Computer Society, 2005.

[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali

Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh

Sardashti, et al. The gem5 simulator. ACM SIGARCH, 39(2):1–7, 2011.

[5] Sarah Bird, Aashish Phansalkar, Lizy K John, Alex Mericas, and Rajeev In-

dukuru. Performance characterization of spec cpu benchmarks on intels core

microarchitecture based processor. In SPEC Benchmark Workshop, 2007.

[6] Darrell Boggs, Gary Brown, Nathan Tuck, and K Venkatraman. Denver: Nvidia’s

first 64-bit arm processor. 2015.

[7] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for

architectural-level power analysis and optimizations, volume 28. ACM, 2000.

119

BIBLIOGRAPHY 120

[8] Doug Burger and Stephen W Keckler. 19.5: Breaking the gop/watt barrier with

edge architectures. In GOMACTech Intelligent Technologies Conference, 2005.

[9] Doug Burger, Stephen W Keckler, K ea McKinley, Mike Dahlin, Lizy K John,

Calvin Lin, Charles R Moore, James Burrill, Robert G McDonald, and William

Yoder. Scaling to the end of silicon with edge architectures. Computer, 37(7):44–

55, 2004.

[10] Keil Tool by ARM. ARM registers. http://www.keil.com/support/man/docs/

armasm/armasm_dom1359731128950.htm. [Online; accessed 24-August-2015].

[11] Kenneth Czechowski, Victor W Lee, Ed Grochowski, Ronny Ronen, Ronak Sing-

hal, Richard Vuduc, and Pradeep Dubey. Improving the energy e�ciency of big

cores. In Proceeding of the 41st annual international symposium on Computer

architecuture, pages 493–504. IEEE Press, 2014.

[12] Bill Dally. Project denver. Processor to usher in new era of comput-

ing.[Online] Available from http://blogs.NVIDIA.com/2011/01/project-denver-

processor-tousher-in-new-era-of-computing/[Accessed 2nd August 2012], 2011.

[13] Bill Dally. Project denver. Processor to usher in new era of computing.[Online]

Available from http://blogs. NVIDIA. com/2011/01/project-denver-processor-

tousher-in-new-era-of-computing/[Accessed 2nd August 2012], 2011.

[14] Subhasis Das, Tor M Aamodt, and William J Dally. Slip: reducing wire en-

ergy in the memory hierarchy. In Proceedings of the 42nd Annual International

Symposium on Computer Architecture, pages 349–361. ACM, 2015.

[15] Michael Ditty, John Montrym, and Craig Wittenbrink. Nvidias tegra k1 system-

on-chip. Hot Chips: A Symposium on High Performance Chips, 2014.

[16] Daniele Folegnani and Antonio Gonzalez. Energy-e↵ective issue logic. In ACM

SIGARCH Computer Architecture News, volume 29, pages 230–239. ACM, 2001.

[17] Linley Gwennap. Mips r12000 to hit 300 mhz. Microprocessor Report, 11(13):1,

1997.

http://www.keil.com/support/man/docs/armasm/armasm_dom1359731128950.htm

http://www.keil.com/support/man/docs/armasm/armasm_dom1359731128950.htm

BIBLIOGRAPHY 121

[18] Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. SimPoint 3.0:

Faster and more flexible program phase analysis. Journal of Instruction Level

Parallelism, 7(4):1–28, 2005.

[19] Eric Hao, Po-Yung Chang, Marius Evers, and Yale N Patt. Increasing the instruc-

tion fetch rate via block-structured instruction set architectures. International

Journal of Parallel Programming, 26(4):449–478, 1998.

[20] Andrew Hilton, Santosh Nagarakatte, and Amir Roth. icfp: Tolerating all-level

cache misses in in-order processors. In High Performance Computer Architec-

ture, 2009. HPCA 2009. IEEE 15th International Symposium on, pages 431–442.

IEEE, 2009.

[21] Andrew Hilton and Amir Roth. Bolt: energy-e�cient out-of-order latency-

tolerant execution. In High Performance Computer Architecture (HPCA), 2010

IEEE 16th International Symposium on, pages 1–12. IEEE, 2010.

[22] Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, et al. The microarchi-

tecture of the pentium R� 4 processor. In Intel Technology Journal. Citeseer,

2001.

[23] Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan,

Kevin Skadron, and Mircea R Stan. Hotspot: A compact thermal modeling

methodology for early-stage vlsi design. Very Large Scale Integration (VLSI)

Systems, IEEE Transactions on, 14(5):501–513, 2006.

[24] Wen-Mei W Hwu, Scott A Mahlke, William Y Chen, Pohua P Chang, Nancy J

Warter, Roger A Bringmann, Roland G Ouellette, Richard E Hank, Tokuzo

Kiyohara, Grant E Haab, et al. The superblock: an e↵ective technique for vliw

and superscalar compilation. the Journal of Supercomputing, 7(1-2):229–248,

1993.

[25] S Itr. Itrs 2012 executive summary. ITRS.[Online]. Available: http://www. itrs.

net/Links/2012ITRS/Home2012. htm.

BIBLIOGRAPHY 122

[26] Vinod Kathail, Michael Schlansker, and Bantwal Ramakrishna Rau. HPL Play-

Doh architecture specification: Version 1.0. Hewlett-Packard Laboratories, 1994.

[27] Richard E Kessler, Edward J McLellan, and David A Webb. The alpha 21264

microprocessor architecture. In Computer Design: VLSI in Computers and Pro-

cessors, 1998. ICCD’98. Proceedings. International Conference on, pages 90–95.

IEEE, 1998.

[28] K Khubaib, M Aater Suleman, Milad Hashemi, Chris Wilkerson, and Yale N

Patt. Morphcore: An energy-e�cient microarchitecture for high performance

ilp and high throughput tlp. In Microarchitecture (MICRO), 2012 45th Annual

IEEE/ACM International Symposium on, pages 305–316. IEEE, 2012.

[29] H-S Kim and James E Smith. An instruction set and microarchitecture for

instruction level distributed processing. In Computer Architecture, 2002. Pro-

ceedings. 29th Annual International Symposium on, pages 71–81. IEEE, 2002.

[30] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen,

and Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling

framework for multicore and manycore architectures. In Microarchitecture, 2009.

MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 469–

480. IEEE, 2009.

[31] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geo↵

Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building

customized program analysis tools with dynamic instrumentation. ACM Sigplan

Notices, 40(6):190–200, 2005.

[32] Stephen W Mahin, Stephen M Conor, Stephen J Ciavaglia, Lyman H Moul-

ton III, Stephen E Rich, and Paul D Kartschoke. Superscalar instruction pipeline

using alignment logic responsive to boundary identification logic for aligning and

appending variable length instructions to instructions stored in cache, April 29

1997. US Patent 5,625,787.

BIBLIOGRAPHY 123

[33] Scott A Mahlke, David C Lin, William Y Chen, Richard E Hank, and Roger A

Bringmann. E↵ective compiler support for predicated execution using the hyper-

block. In ACM SIGMICRO Newsletter, volume 23, pages 45–54. IEEE Computer

Society Press, 1992.

[34] Daniel S McFarlin, Charles Tucker, and Craig Zilles. Discerning the dominant

out-of-order performance advantage: is it speculation or dynamism? In ACM

SIGPLAN Notices, volume 48, pages 241–252. ACM, 2013.

[35] Stephen Melvin and Yale Patt. Enhancing instruction scheduling with a block-

structured isa. International Journal of Parallel Programming, 23(3):221–243,

1995.

[36] Milad Mohammadi, Shuo Han, Tor Aamodt, and Barry Daly. On-demand dy-

namic branch prediction. 2013.

[37] Milad Mohammadi, Song Han, T Aamodt, and WJ Dally. On-demand dynamic

branch prediction.

[38] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N Patt. Runahead execu-

tion: An alternative to very large instruction windows for out-of-order processors.

In High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.

The Ninth International Symposium on, pages 129–140. IEEE, 2003.

[39] Subbarao Palacharla, Norman P Jouppi, and James E Smith. Complexity-

e↵ective superscalar processors, volume 25. ACM, 1997.

[40] Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. Marss: a full system

simulator for multicore x86 cpus. In Proceedings of the 48th Design Automation

Conference, pages 1050–1055. ACM, 2011.

[41] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and

Anand Karunanidhi. Pinpointing representative portions of large intel R�itanium R� programs with dynamic instrumentation. In Proceedings of the 37th

BIBLIOGRAPHY 124

annual IEEE/ACM International Symposium on Microarchitecture, pages 81–92.

IEEE Computer Society, 2004.

[42] Glenn Reinman, Todd Austin, and Brad Calder. A scalable front-end architec-

ture for fast instruction delivery. In ACM SIGARCH Computer Architecture

News, volume 27, pages 234–245. IEEE Computer Society, 1999.

[43] Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace pro-

cessors. In Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM

International Symposium on, pages 138–148. IEEE, 1997.

[44] Pierre Salverda and Craig Zilles. Fundamental performance constraints in hor-

izontal fusion of in-order cores. In High Performance Computer Architecture,

2008. HPCA 2008. IEEE 14th International Symposium on, pages 252–263.

IEEE, 2008.

[45] Daniel Sanchez and Christos Kozyrakis. Zsim: fast and accurate microarchi-

tectural simulation of thousand-core systems. In ACM SIGARCH Computer

Architecture News, volume 41, pages 475–486. ACM, 2013.

[46] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu

Kim, Jaehyuk Huh, Doug Burger, Stephen W Keckler, and Charles R Moore.

Exploiting ilp, tlp, and dlp with the polymorphous trips architecture. In Com-

puter Architecture, 2003. Proceedings. 30th Annual International Symposium on,

pages 422–433. IEEE, 2003.

[47] Michael S Schlansker and B Ramakrishna Rau. EPIC: An architecture for

instruction-level parallel processors. Hewlett-Packard Laboratories, 2000.

[48] Andre Seznec, Stephen Felix, Venkata Krishnan, and Yiannakis Sazeides. Design

Tradeo↵s for the Alpha EV8 Conditional Branch Predictor. In Proc. IEEE/ACM

Symp. on Computer Architecture (ISCA), pages 295–306, 2002.

[49] Gurindar S Sohi, Scott E Breach, and TN Vijaykumar. Multiscalar processors.

In ACM SIGARCH Computer Architecture News, volume 23, pages 414–425.

ACM, 1995.

BIBLIOGRAPHY 125

[50] S Peter Song, Marvin Denman, and Joe Chang. The powerpc 604 risc micropro-

cessor. IEEE Micro, (5):8–17, 1994.

[51] Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike

Upton. Continual flow pipelines. ACM SIGPLAN Notices, 39(11):107–119, 2004.

[52] Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin.

Wavescalar. In Proceedings of the 36th annual IEEE/ACM International Sym-

posium on Microarchitecture, page 291. IEEE Computer Society, 2003.

[53] Francis Tseng and Yale N Patt. Achieving out-of-order performance with almost

in-order complexity. In Computer Architecture, 2008. ISCA’08. 35th Interna-

tional Symposium on, pages 3–12. IEEE, 2008.

[54] Jessica H Tseng and Krste Asanovic. Banked multiported register files for high-

frequency superscalar microprocessors. ACM SIGARCH Computer Architecture

News, 31(2):62–71, 2003.

[55] Yasuko Watanabe, John D Davis, and David A Wood. Widget: Wisconsin

decoupled grid execution tiles. In ACM SIGARCH Computer Architecture News,

volume 38, pages 2–13. ACM, 2010.

[56] Robert G Wedig and Marc A Rose. The reduction of branch instruction exe-

cution overhead using structured control flow. In ACM SIGARCH Computer

Architecture News, volume 12, pages 119–125. ACM, 1984.

[57] Sam Likun Xi, Hans Jacobson, Pradip Bose, Gu-Yeon Wei, and David Brooks.

Quantifying sources of error in mcpat and potential impacts on architectural

studies. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st

International Symposium on, pages 577–589. IEEE, 2015.

[58] Matt T Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural

simulator. In Performance Analysis of Systems & Software, 2007. ISPASS 2007.

IEEE International Symposium on, pages 23–34. IEEE, 2007.

BIBLIOGRAPHY 126

[59] Ahmad Zmily and Christos Kozyrakis. Simultaneously improving code size, per-

formance, and energy in embedded processors. In Proceedings of the conference

on Design, automation and test in Europe: Proceedings, pages 224–229. Euro-

pean Design and Automation Association, 2006.

Documents

ENERGY-EFFICIENT COARSE-GRAIN OUT-OF-ORDER EXECUTION ...bp863pb8596/thesis_adobe... · zoei, and Alborz Bejnood. I would like to sincerely thank my extended family, Soraya, Hosein,