Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ENERGY-EFFICIENT
COARSE-GRAIN OUT-OF-ORDER EXECUTION
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Milad Mohammadi
August 2015
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/bp863pb8596
© 2015 by Milad Mohammadi. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Bill Dally, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Alex Aiken
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Christos Kozyrakis
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Tor Aamodt
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Preface
Throughout the past decade, energy e�cient computing has been a major problem
in the computer system space from device technologies, to chip design, to computer
architecture and software systems. Today, we have more smartphone devices in the
hands of consumers than ever before, and we generate more data on the internet than
ever before. This trend implies that moving forward, building energy e�cient systems
remains an increasingly significant challenge both in the mobile industry and in the
server industry. Building energy e�cient processors is at the forefront of solving these
technological challenges.
This doctoral dissertation describes the Coarse-Grain Out-of-Order (CG-OoO)
processor architecture. Block-level code processing is at the heart of the CG-OoO
architecture; CG-OoO speculates, fetches, schedules, and commits code at block-level
granularity. CG-OoO eliminates unnecessary accesses to energy consuming tables,
and turns large tables into smaller and distributed tables that are cheaper to access.
CG-OoO leverages compiler-level code optimizations to deliver more e�cient static
code. It exploits instruction-level parallelism and block-level parallelism. CG-OoO
introduces the Skipahead model which is a complexity e↵ective, limited out-of-order
instruction issue model. It is an energy-performance proportional design that can
scale according to the program load. Through the energy e�ciency techniques applied
to the compiler and processor pipeline stages, CG-OoO delivers over 50% energy
reduction at the performance of the baseline out-of-order processor.
iv
Acknowledgements
The work presented in this doctoral dissertation could not have been possible without
the support, care, and guidance of many individuals. I would like to extend my sincere
appreciation to the following people.
I would like to express my gratitude to my wonderful advisor and teacher, Professor
William J. Dally, whose vision and intuition has provided me guidance throughout my
graduate studies. I greatly benefited from Professor Dally’s endless optimism toward
solving challenging problems and his boundless enthusiasm for innovation.
I would like to sincerely thank my second advisor, Professor Tor M. Aamodt, whose
wealth of knowledge and depth of intuition provided me the tools I needed to advance
my research. Special thanks to Professor Christos Kozyrakis for his encouragements
during the time I was entering the field of computer architecture, and for accepting
to be part of my dissertation reading committee. I had the pleasure of having him as
my lecturer for three computer architecture courses that provided me the scientific
foundation to pursue my PhD research in this field. I also thank Professor Alex
Aiken for accepting to be part of my dissertation reading committee. I had the distinct
pleasure of being his student in the Stanford Parallel Computing class. Special thanks
to Professor Mark Horowitz for providing valuable feedbacks on my thesis during the
final year of my PhD studies.
I would like to acknowledge and thank my fantastic friends and colleagues in
the CVA lab: Curt Harting, Ted Jiang, Daniel Becker, George Michelogiannakis,
James Chen, Subhasis Das, Nic McDonald, Song Han, Albert Ng, Vishal Parikh,
Camilo Moreno, Yatish Turakhia. I would also like to thank the wonderful CVA
administrators, Sue George and Uma Mulukutla who have been always kind and
v
helpful to me. Special thanks to Curt Harting and James Chen for their mentorship
durnig the first half of my tenure at the CVA lab. Also, special thanks to my friend,
Subhasis Das, for being a passionate and smart labmate (with a great sense of humor),
especially during our collaboration on building the energy model for this thesis.
I would like to thank my friends Behnam Montazeri, Christina Delimitrou, Camilo
Moreno, Ardavan Pedram, and Nicole Celeste Rodia who engaged with my research
and provided valuable feedbacks. I would also like to thank my wonderful friends
in the Stanford Persian community whose presence made Stanford feel like home. I
specially thank my friends with whom I ran the Stanford Persian Student Association
(PSA) board, Ehsan Sadeghipour, Reza Mirghaderi, Dorna Kashef, Alireza Sharafat,
Parnian Zargham, Maryam Daneshi, Pooya Ehsani, Shahab Mirjalili, Masoud Tava-
zoei, and Alborz Bejnood.
I would like to sincerely thank my extended family, Soraya, Hosein, Pedram, and
Payam Lajevardi. Soraya and Hossein have been nothing short of my second parents
during the years I lived away from home. I thank Pedram and Payam for the numerous
brotherly advises and encouragements they gave me thorughout my post-secondary
education.
I thank my sisters, Mojdeh and Yasamin Mohammadi, and my brother-in-law,
Farhad Fereidooni, for their love and support throughout the years. I also thank
them for giving my parents the care and attention they deserved during my 11-year
absence from home.
I thank my wonderful parent-in-laws, Morteza and Mehri Mohammadgiahi whose
love and emotional support has always brought hope and strength to my family.
I would like to sincerely thank my exceptional parents, Amirhossein and Soheila
Mohammadi who enabled and supported me to pursue my dream of becoming a
scientist. Their unconditional love for me and their enormous sacrifices provided me
the courage to pursue my dreams. I find myself in eternal debt to them.
I especially thank my extraordinary wife, my best friend, Marjan, whose continual
selfless support, at my busiest and toughest moments, helped me focus on research,
and whose unshakable confidence in me, even at times when I doubted my ability to
deliver, gave me courage to march forward.
vi
Dedicated to my wife and my best freiend, Marjan.
vii
Contents
Preface iv
Acknowledgements v
1 Introduction 1
1.1 Collaborations and Other Contributions . . . . . . . . . . . . . . . . 2
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 OoO Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Coarse-Grain Out-of-Order Execution Model 12
3.1 Coarse-Grain Out-of-Order Execution . . . . . . . . . . . . . . . . . . 12
3.2 Constructing Blocks for CG-OoO . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Block Boundary Annotation . . . . . . . . . . . . . . . . . . . 15
3.2.2 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Program Execution Flow in CG-OoO . . . . . . . . . . . . . . 18
3.3.2 Control Speculation in CG-OoO . . . . . . . . . . . . . . . . . 21
3.4 Squash Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Squash due to Control Mis-speculation . . . . . . . . . . . . . 22
3.4.2 Squash due to Memory Mis-speculation . . . . . . . . . . . . . 23
viii
3.5 Static Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Sources of Parallelism in Cg-OoO . . . . . . . . . . . . . . . . . . . . 24
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 System Architecture 27
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.2 Fetch Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Decode Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.4 Register Rename / Block Allocation Stage . . . . . . . . . . . 37
4.2.5 Instruction Steer Stage . . . . . . . . . . . . . . . . . . . . . . 42
4.2.6 Front-end Examples . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.7 Issue Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.8 Memory Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.9 Write-Back & Commit Stage . . . . . . . . . . . . . . . . . . . 57
4.3 Squash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Methodology 62
5.1 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Functional Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Timing Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Coarse-Grain Out-of-Order Evaluation 85
6.1 Sources of Energy Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 CG-OoO Design Characterization . . . . . . . . . . . . . . . . . . . . 86
6.3 CG-OoO Performance Analysis . . . . . . . . . . . . . . . . . . . . . 88
6.4 CG-OoO Energy Analysis . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Block Level Branch Prediction . . . . . . . . . . . . . . . . . . 93
ix
6.4.2 Register File Hierarchy . . . . . . . . . . . . . . . . . . . . . . 93
6.4.3 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . 98
6.4.4 Block Re-Order Bu↵er . . . . . . . . . . . . . . . . . . . . . . 99
6.5 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Related Work 104
7.1 CG-OoO Design Features . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 CG-OoO Energy E�ciency Features . . . . . . . . . . . . . . . . . . . 109
7.2.1 Degree of Coarse Granularity . . . . . . . . . . . . . . . . . . 110
7.2.2 Front-end Energy E�ciency . . . . . . . . . . . . . . . . . . . 110
7.2.3 Back-end Energy E�ciency . . . . . . . . . . . . . . . . . . . 112
7.3 OoO Energy E�ciency Arguments . . . . . . . . . . . . . . . . . . . 113
7.4 Energy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.5 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8 Conclusion 115
8.1 Summary of Thesis Contributions . . . . . . . . . . . . . . . . . . . . 115
8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 117
Bibliography 119
x
List of Tables
5.1 System Parameters Shared Between All Core Architectures . . . . . . 71
5.2 System Parameters for Each Individual Core . . . . . . . . . . . . . . 72
7.1 Related Work: High Level Design Features Comparison . . . . . . . . 106
7.2 Related Work: Micro-architectural Features Comparison . . . . . . . 111
xi
List of Figures
2.1 OoO, InO Execution Model Example . . . . . . . . . . . . . . . . . . 7
2.2 Enery and Performance Overhead of OoO vs. InO . . . . . . . . . . . 8
2.3 OoO Architecture Pipeline Model . . . . . . . . . . . . . . . . . . . . 9
3.1 Block-Level Dynamic Execution Model . . . . . . . . . . . . . . . . . 14
3.2 The head instruction format. . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 do-while Loop Example . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 CG-OoO Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 CG-OoO Instruction Flow Example . . . . . . . . . . . . . . . . . . . 20
3.6 Instruction Flow Through the CG-OoO Front-end Example . . . . . . 22
3.7 CG-OoO Instruction Flow Squash Example . . . . . . . . . . . . . . 23
4.1 CG-OoO Detailed Micro-architecture . . . . . . . . . . . . . . . . . . 28
4.2 Branch Prediction Unit (BPU) micro-architecture . . . . . . . . . . . 30
4.3 Instruction Cache Fetch Example . . . . . . . . . . . . . . . . . . . . 32
4.4 An Example Code Fetch Sequence for CG-OoO . . . . . . . . . . . . 34
4.5 Logic Unit to Detect head Operations . . . . . . . . . . . . . . . . . . 36
4.6 Register Rename Bypass Logic . . . . . . . . . . . . . . . . . . . . . . 39
4.7 Block Allocation State Transition Diagram . . . . . . . . . . . . . . . 40
4.8 Block Allocation Routing Diagram . . . . . . . . . . . . . . . . . . . 41
4.9 CG-OoO Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 CG-OoO Fetch and Decode Examples . . . . . . . . . . . . . . . . . . 44
4.11 Block Window Components . . . . . . . . . . . . . . . . . . . . . . . 46
4.12 Instruction Queue & Head Bu↵er Entry Formats . . . . . . . . . . . . 46
xii
4.13 Instruction Allocation & Issue Pipeline Stages . . . . . . . . . . . . . 48
4.14 Instruction Queue and Head Bu↵er Micro-architecture . . . . . . . . 49
4.15 Data-Forwarding and Wakeup Models . . . . . . . . . . . . . . . . . . 50
4.16 GRF segment access demultiplexer . . . . . . . . . . . . . . . . . . . 52
4.17 Interconnection Network Connecting for EU Clusters . . . . . . . . . 53
4.18 Data Dependency Code Example . . . . . . . . . . . . . . . . . . . . 54
4.19 Head Bu↵er Micro-architecture . . . . . . . . . . . . . . . . . . . . . 56
4.20 Block Re-Order Bu↵er (BROB) Entry Format . . . . . . . . . . . . . 58
5.1 Simulation Software Infrastructure . . . . . . . . . . . . . . . . . . . 64
5.2 Compiler Software Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Functional Emulator Software Infrastructure . . . . . . . . . . . . . . 66
5.4 Wrong-Path Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Average Number of Instructions on the Wrong-Path . . . . . . . . . . 69
5.6 InO, OoO, CG-OoO Processor Pipelines . . . . . . . . . . . . . . . . 71
5.7 Squash State Transition Diagram . . . . . . . . . . . . . . . . . . . . 74
5.8 Squash Model Example . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.9 Energy Model Software Infrastructure . . . . . . . . . . . . . . . . . . 76
5.10 SPICE Energy Measurement Signal . . . . . . . . . . . . . . . . . . . 78
5.11 Energy Model Configuration Example . . . . . . . . . . . . . . . . . . 79
5.12 SRAM Table Energy & Area - Size Sweep . . . . . . . . . . . . . . . 80
5.13 SRAM Table Energy & Area - Port Sweep . . . . . . . . . . . . . . . 80
5.14 RAM & CAM Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.15 Flip-Flop in SPICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1 OoO Energy Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 OoO Energy Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Dynamic Code Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 CG-OoO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5 4-Wide CG-OoO Performance . . . . . . . . . . . . . . . . . . . . . . 90
6.6 List Scheduling E↵ect on Performance . . . . . . . . . . . . . . . . . 90
6.7 Skipahead Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xiii
6.8 Processor Widths E↵ect on Performance . . . . . . . . . . . . . . . . 92
6.9 Normalized Processors Energy . . . . . . . . . . . . . . . . . . . . . . 94
6.10 Energy-Delay Product Inverse . . . . . . . . . . . . . . . . . . . . . . 95
6.11 Static & Dynamic Energy Breakdown . . . . . . . . . . . . . . . . . . 95
6.12 CG-OoO BPU Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.13 Normalized Register File Energy . . . . . . . . . . . . . . . . . . . . . 97
6.14 Register Renaming Energy . . . . . . . . . . . . . . . . . . . . . . . . 98
6.15 Segmented Register File Energy Trend . . . . . . . . . . . . . . . . . 99
6.16 Dynamic Scheduler Energy . . . . . . . . . . . . . . . . . . . . . . . . 100
6.17 Commit Stage Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.18 Harmonic Mean Speedup and Normalized Energy . . . . . . . . . . . 102
6.19 Normalized Power vs. Performance . . . . . . . . . . . . . . . . . . . 103
xiv
Chapter 1
Introduction
With the recent technology innovations in di↵erent fields of computer science and
computer technology including genomic, social media, online entertainment, the re-
quirements for building energy e�cient and high performance computer processors
has been increasing. In the consumer mobile space, also, the demand for building
more energy e�cient processors that help extend battery life has been a major focus
in the computer architecture community and the processor manufacturing industry.
This research project has taken a bottom-up approach in identifying the energy
ine�ciencies of existing single-threaded architectures; namely the out-of-order (OoO)
processor. The result of this study has been a processor design that addresses these
ine�ciencies while maintaining nearly the same level of performance as existing pro-
cessors in the industry. My research led me to find that the energy ine�ciency of the
OoO processor is rooted in its overall execution model design. In other words, there
is no one component in the core hardware that dominates the energy consumption.
This thesis addresses the energy problem by devising an alternative execution
model called the Coarse-Grain Out-of-Order (CG-OoO) model. The name, Coarse-
Grain, refers to the processor ability to process groups of instructions as a whole,
instead of processing instructions individually as is done in the out-of-order model.
As will be discussed in future chapters, the key to building such an energy e�cient,
high-performance, single-threaded processor is in finding a design point where the
processor complexity is nearly as simple as an in-order processor, but its ability to
1
CHAPTER 1. INTRODUCTION 2
deliver high instruction-level parallelism is paramount. Handling of instructions in
groups proves to be the essential architectural requirement for the CG-OoO processor
to deliver its superior energy e�ciency.
Multiple publications have shown as the performance capability of processors in-
creases, their energy per operation increases non-linearly [2, 55]. In other words, the
energy cost for building powerful processors is more than the obtained performance
benefit. In this work, I focus on designing the CG-OoO execution model such that
it enables much cheaper energy consumption for the same performance capability as
the OoO processor while benefiting from a linearly proportional energy-performance
scaling trade o↵.
1.1 Collaborations and Other Contributions
During my research studies at the Stanford Concurrent VLSI Architectures lab (CVA),
I worked on two research projects: the CG-OoO project which is the topic of this
thesis done in collaboration with Tor M. Aamodt and William J. Dally, and the
On-Demand Dynamic Branch Prediction (ODBP) project [36] which was done in
collaboration with Song Han, Tor M. Aamodt, and William J. Dally. The ODBP
work focused on building an energy e�cient branch prediction mechanism for out-of-
order processors that helps eliminate unnecessary accesses to the branch prediction
unit for the purpose of reducing its energy consumption and improving its prediction
accuracy.
1.2 Thesis Contributions
The main contribution of this work is the design of the energy-e�cient coarse-grain
out-of-order architecture which reaches the performance of the out-of-order execution
model with over 50% energy consumption reduction.
Given the level of complexity and novelty of this architecture research, new com-
piler, simulation, and energy modeling infrastructures have been built. The simulation
framework is targeted toward modeling the single-threaded coarse-grain out-of-order,
CHAPTER 1. INTRODUCTION 3
out-of-order, and in-order processors. It is built on top of the Pintool API [31] and
supports an integrated energy model. The energy model consists of several compo-
nents that estimate energy consumption of di↵erent hardware components. It utilizes
SPICE, Verilog, and HotSpot [23] simulations for estimating energy numbers. Addi-
tionally, a compiler back-end is built to produce energy e�cient code with an alterna-
tive Instruction Set Architecture (ISA) named the CG-OoO ISA. The new ISA di↵ers
from the x86 ISA in its additional instruction feature for supporting block level code
processing, and in its register file model. The compiler takes optimized code from gcc
and performs additional optimizations to improve the energy e�ciency of the code.
This research project revisits most of the out-of-order processor pipeline stages
and devises a design alternative that makes each stage more energy e�cient. The
following list highlights the key topics under which these energy e�cient solutions are
developed and then integrated into the CG-OoO processor model:
• A code-clustering compilation framework
• A block-level branch prediction and fetch model
• A novel instruction scheduling model called Skipahead that supports limited
out-of-order instruction issue
• A distributed register file hierarchy. In this model registers are managed through
a static and dynamic register allocation hybrid. The register file hierarchy also
enables building an energy e�cient register rename unit.
• A new re-order bu↵er design that tracks program order and handles squash
events at block granularity
• A distributed and clustered execution model that enables proportional energy-
performance scaling.
The evaluation framework used for this work is in-house. I have built a simulation
software infrastructure based on the Pintool API which consists of three major com-
ponents: a detailed energy model, a functional emulator running code on the native
CHAPTER 1. INTRODUCTION 4
processor and instrumenting it for later use by the third component of this tool, the
detailed timing simulator. Additionally, this project consists of a code processing
component that required developing a dedicated compiler framework to extend the
code optimizations and analysis done by gcc. This compiler reformats the x86 ISA
into an alternative ISA designed for supporting coarse-grain execution (i.e. CG-OoO
ISA). The timing simulator uses the CG-OoO ISA for performance evaluations. To
my knowledge, no publicly available simulation model, with the attributes and fea-
tures of this simulator, exists in the computer architecture community. Chapter 5
discusses the details of the compiler and the entire simulator.
1.3 Thesis Organization
The remainder of this thesis is organized as follows. Chapter 2 provides background
information on the execution model and energy versus performance properties of exist-
ing processors. It also provides the foundation for why an alternative execution model
is necessary in order to bring substantial energy e�ciency to the single-threaded,
general purpose processor models. Chapter 3 promotes the coarse-grain out-of-order
execution model via several examples, and describes the main processor architecture
features at a high level. Chapter 4 discusses the CG-OoO architecture features in
detail and describes how the processor functions. Chapter 5 discusses the evaluation
methodology of this work; since a majority of the evaluation infrastructure has been
built in-house a great deal of discussion is aimed toward presenting the details of each
building block in the compiler, the simulation infrastructure, and the energy model.
Chapter 6 evaluates the performance and energy characteristics of the coarse-grain
out-of-order model and compares them against the out-of-order and in-order proces-
sor baseline models. Chapter 7 provides an analysis of previous work in the literature
regarding performance and energy optimizations done on single-threaded processors
and highlights how the CG-OoO model is di↵erentiated from each work. Chapter 8
provides concluding remarks on this thesis.
Chapter 2
Background
This chapter introduces two common classes of processors: in-order and out-of-order.
At a high level, it introduces the design elements of the out-of-order processor that
contribute to its superior performance compared to the in-order processor, and de-
scribes the sources of energy ine�ciency associated with the out-of-order processor.
2.1 OoO Execution Model
In-order (InO) and out-of-order (OoO) processors are among the popular single-
threaded execution models in the computer architecture community. In the InO
execution model, instructions are simply executed in program order. In the OoO
execution model, however, instructions can be executed out of the original program
order while maintaining program correctness.
Figure 2.1a shows a simple dynamic code sequence consisting of eight instructions.
Figure 2.1b shows the corresponding data-dependency between the instructions where
circles are the instructions and edges are the data-dependency links between the
instructions. In this example, instruction 2 can only be executed after instruction
1 completes its execution. On the other hand, instruction 3 is free to be executed
anytime before or after instruction 1 as it has no data-dependency on 1. The numbers
on the edges indicate the cycle count for operations to generate their results.
In the case of the InO model, the program will be executed according to the
5
CHAPTER 2. BACKGROUND 6
original program sequence as shown in Figure 2.1c; since instruction 1 takes three
cycles to complete its execution, two stall cycles are introduced. The same situation
adds an additional 2-cycle delay after the issue of instruction 3. In the case of the OoO
model, the processor dynamically tracks data dependencies between instructions; it
identifies instruction 3 as an independent instruction to instruction 1 and issues 3
at cycle 3 (see Figure 2.1d). Executing 3 early eliminates three stall cycles from the
execution flow leading to 23% reduction in execution time.
Scheduling of instructions out-of-order is known as dynamic instruction schedul-
ing. To enable dynamic scheduling, the OoO processor leverages program speculation
to issue future instructions earlier. For instance, to fetch instruction 1, the OoO pro-
cessor first speculates that instruction 1 will very likely be executed in the future by
speculating on instruction 0. It also determines 3 has no data-dependency to other
in-flight operations (i.e. instructions 0 and 1). Finally, to hide the latency of instruc-
tion 1, the processor issues instruction 3 in cycle 3. It is through the combination
of dynamic instruction scheduling and speculation that the OoO processor achieves
superior performance compared to the InO processor.
McFarlin et al. [34] characterizes the benefits of the OoO processor with respect to
the InO processor and concludes 88% of the OoO performance gain comes from spec-
ulation, 10% comes from dynamic scheduling, and 2% comes from improved branch
mis-prediction recovery. Figure 2.2a depicts this performance breakdown for the
SPEC Int 2006 benchmarks. Several fundamental reasons lead to the great per-
formance advantage of the out-of-order processor; namely e↵ective false-dependency
elimination via register renaming, aggressive control speculation, accurate memory
speculation to enable memory level parallelism (MLP), and e�cient wake-up/select
logic to issue ready instructions. These features are enabled through a number of key
hardware units including the branch predictor, register renaming tables, load-store
queue unit, memory disambiguation tables, instruction queue or reservation stations,
and the re-order bu↵er; Figure 2.3 illustrates these units for a 4-wide superscalar
out-of-order processor pipeline. Accesses to tables within these units is the main
source of energy overhead associated with the OoO processor. Figure 2.2b shows
CHAPTER 2. BACKGROUND 7
STALL
STALL
STALL
STALL
STALL
1
2
7 5
3
64
8
1. LOAD
2. ADD
3. LOAD
4. ADD
5. ADD
6. STORE
7. STORE
8. BRANCH
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
In-Order (InO) Out-of-Order (OoO)
1
2
3
4
5
6
7
8
9
10
11
12
3
1 1 3 3
11
Tim
e
(a) (b) (c) (d)
0. BRANCH
0
00
13
Basic
-blo
ck
Data-flow Graph
Not-Taken
Take
n
Figure 2.1: (a) a simple dynamic code sequence example consisting of nine assem-bly instructions. (b) the data-dependency graph corresponding to this instructionsequence. The color codes separate the data dependent subsets. The numbers on theedges indicate the number of cycles for each operation to generate its result. Thedotted gray arrows show the control flow for control instruction 0. (c) the executionschedule of instructions in a 1-wide in-order processor. (d) the execution scheduleof instructions in a 1-wide out-of-order processor. In this example, the out-of-orderschedule is three cycles faster than the in-order schedule.
CHAPTER 2. BACKGROUND 8
57%$ 57%$
38%$
4%$1%$
0%$
20%$
40%$
60%$
80%$
100%$
InO$ OoO$
Speedu
p&Average&Performance&
(SPEC&Int&2006)&
Branch$Mis8Predic<on$Handling$
Dynamic$Scheduling$
Specula<on$
Base$
(a)
27%$ 27%$
36%$
18%$
19%$
0%$
20%$
40%$
60%$
80%$
100%$
InO$ OoO$
Normalized
+EPC
+
Normalized+Average+Energy+Per+Cycle+(SPEC+Int+2006)+
High$Frequency$Table$Access$
High$Cost$Table$Access$
Dynamic$Scheduling$
Base$
(b)
Figure 2.2: (a) The speedup of the out-of-order (OoO) processor compared to thein-order processor is broken into three key categories [34]. (b) Harmonic mean energyper cycle (EPC) of the SPEC Int 2006 benchmarks simulated on the in-order and out-of-order processors. The OoO energy overhead is divided into three main categories.
this energy overhead is broken into three categories: accesses to large tables, unneces-
sarily frequent table accesses, and dynamic instruction scheduling, which is primarily
associated with accessing the instruction window tables.
Here, I qualitatively expand on these energy consuming units. In Chapter 6, their
energy and performance profile will be discussed and quantified in detail.
The branch predictor tables enable program speculation. They allow the processor
front-end to run ahead of the back-end by fetching future instructions early. To
guarantee high back-end performance, the front-end is designed to avoid fetch stall
cycles by predicting the next fetch group1 every cycle irrespective of whether the
current fetch group holds a control operation. As will be discussed in later chapters,
the OoO model spends an excessive amount of energy on program speculation which
can be avoided by accessing the branch predictor only at control operations.
The register renaming stage is essential in eliminating runtime write-after-write
(WAW) and write-after-read (WAR) dependencies. By eliminating these false-dependencies,
register renaming allows significantly higher instruction level parallelism (ILP). The
OoO processor renames every register operand for every instruction; this results
1The group of instructions fetched via an instruction-cache access is called a fetch group. Forexample, the fetch group of a 4-wide instruction-cache can contain up to 4 instructions.
CHAPTER 2. BACKGROUND 9
L1 Instruction Cache
L1 Data Cache
L2Cache
FetchPC
BranchPrediction
Decode
Rename
Dispatch
ROB
Instr.Window
Scheduler
EU EU EU EULSU
RegisterFile
Figure 2.3: The pipeline structure for a 4-wide out-of-order execution model. LSUrefers to the load-store unit, EU refers to the execution unit, and ROB refers to there-order bu↵er.
CHAPTER 2. BACKGROUND 10
in a significant energy overhead. In future chapters, I show that for some regis-
ter operands, register renaming can be eliminated to reduce the energy overhead of
renaming table lookups.
The register file is one of the most commonly accessed tables in the processor.
Each instruction accesses the register file 2 or 3 times depending on the number of its
operands.2 Also, superscalar processors issue multiple instructions per cycle requiring
a large number of ports to access data for all instructions.3 In addition, the register
file is usually 4x larger than the number of architectural registers in order to support
register renaming. The larger the size of the register file and the number ports, the
higher the access energy. In future chapters, an energy e�cient register file hierarchy
model is presented to reduce the register file energy.
The instruction scheduler is the major energy consuming component of the core
pipeline. OoO instruction scheduling consists of two main steps: instruction wakeup
and instruction select. Upon the completion of every instruction, its result is writ-
ten into the register file and forwarded to the operations waiting for it. Upon the
availability of all the source operands of an instruction, it is woken up for issue.
Each cycle, the dynamic instruction scheduler selects the n oldest ready (i.e. woken
up) instructions from the instruction queue and issues them to the available execu-
tion units (EU); here, n refers to the number of available EU’s at every cycle. The
instruction scheduler is a unified queue with a random access memory (RAM) struc-
ture that holds the static instruction information such as the op-code and immediate
value, and two content addressable memories (CAM) that hold the source operands
for each operation; the CAM tables allow the wakeup unit to search and update the
source operands of waiting instructions. In future chapters, an energy e�cient and
complexity e↵ective instruction scheduling model is introduced.
The re-order bu↵er (ROB) enables precise exceptions and maintains program or-
der by enforcing in-order instruction commit. The ROB must be accessed by every
dynamic instruction at least three times; once at the rename stage to reserve a ROB
entry, once when the operation completes execution, and once when the operation
2For example, in the MIPS ISA, instructions have at most two source operands and one destina-tion operand.
3In 4-wide OoO processors, building register files with 8 read ports and 4 write ports is common.
CHAPTER 2. BACKGROUND 11
is to be committed.4 In future chapters, I discuss a design structure alternative for
the ROB that supports program order and precise exceptions without the need for
accessing it as frequently as the OoO model; reducing the ROB access frequency
reduces its energy consumption.
The goal of the rest of this study is to identify the contribution of di↵erent tables
to the OoO energy overhead and to devise compiler techniques as well as architectural
modifications to reduce these energy overheads. As will be discussed in future chap-
ters, reducing these energy overheads demands introducing an alternative execution
model compared to that of the OoO processor. It is expected that the new execution
model maintains support for dynamic execution and speculation, but proposes design
solutions for reducing the energy cost.
2.2 Chapter Summary
This chapter introduced the fundamental architectural elements of the OoO processor
that contribute to its performance e�ciency; namely dynamic instruction scheduling
and program speculation. It described these features are enabled through the branch
predictor, register rename, dynamic instruction scheduler, and re-order bu↵er, and
explained these units are the ones that also consume the majority of the OoO energy
overhead. The goal of this work is to devise compiler and architectural solutions
that substantially reduce the energy consumption of these units while maintaining
the same level of performance as the OoO model.
4Here, I assume the ROB does not hold intermediate register data (i.e. using a physical registerfile).
Chapter 3
Coarse-Grain Out-of-Order
Execution Model
In this chapter, I introduce an energy e�cient and high performance execution model
named Coarse-Grain Out-of-Order (CG-OoO) execution and describe it through a
number of examples. I also describe the sources of energy saving in CG-OoO. This
chapter builds the foundation for detailed discussions on the processor architecture
and its performance analysis in Chapters 4 and 6.
3.1 Coarse-Grain Out-of-Order Execution
I present an energy e�cient and high-performance single-threaded core architecture
for general-purpose computing named Coarse-Grain Out-of-Order (CG-OoO). The
key insight behind building this architecture is block-level dynamic execution. Block-
level execution has been previously studied in various contexts [19,35]. These studies
are further elaborate in Chapter 7.
In this framework a block is defined as a sequence of static instructions clus-
tered together. Each code block, in this study, is a control-flow basic-block. Block-
level dynamic execution means the branch prediction, dispatch, instruction scheduler,
operand write-back, commit, and squash units are designed to handle blocks as the
primary unit of execution (rather than instructions). In this model, the processor
12
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 13
speculatively fetches code blocks into the execution pipeline, and allocates to each
block a separate first-in-first-out (FIFO) instruction queue called a Block Window
(BW) (Figure 3.1). The instruction scheduler checks the head of each BW in a
round-robin manner to find ready instructions to issue. Once all instructions in a
code block complete execution, the block is ready to retire.
This architecture motivates a new design model that is substantially more energy
e�cient, and that sacrifices negligible performance compared to the OoO core, dis-
cussed in Chapter 2. The following list summarizes the high-level design techniques
used in CG-OoO to save energy. It also outlines solutions to the list of energy in-
e�ciency drawbacks of the OoO design discussed in Chapter 2. The remainder of
this chapter and the future chapters focus on explaining the following techniques and
evaluating their impact on the overall energy and performance of CG-OoO.
• Small Tables: CG-OoO replaces large and centralized hardware structures
such as the instruction queue, register file, and re-order bu↵er (ROB) with
smaller, less complex, and distributed structures. For instance, as shown in
Figure 3.1, the OoO core Instruction Window is replaced with the BW’s as
decentralized FIFO bu↵ers that hold block instructions. CG-OoO replaces the
conventional Re-Order Bu↵er (ROB) with a 10⇥ smaller table called Block Re-
Order Bu↵er (BROB) that tracks code blocks (see Table 5.2). CG-OoO also
uses a novel decentralized register file hierarchy that is discussed in detail in
Chapter 4. Through reducing table sizes, the energy to access these tables is
reduced.
• Hybrid Instruction Scheduling: CG-OoO combines static and dynamic in-
struction scheduling to reduce the hardware energy consumption through reduc-
ing runtime instruction scheduling overhead. The compiler generates optimal
static list schedules for instructions within each block. As illustrated in Fig-
ure 3.1, the dynamic scheduler scans the head of its BW’s to find and issue
ready instructions.
• Reduced Table Accesses: CG-OoO reduces access to energy hungry hard-
ware units such as the register renamer and branch predictor. Compiler support
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 14
BW BW
EU EU
Scheduler
Core Backend
BW
Core Frontend
EU
Scheduler
BW
EU
Figure 3.1: Block-level dynamic execution model. BW stands for Block Window, andEU stands for Execution Unit. A BW holds operations that belong to a block of code.EU’s and BW’s are grouped together to form an execution cluster; each executioncluster is being managed by a separate scheduler. Each instruction scheduler checksthe head of its BW’s to issue ready instructions to its EU’s.
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 15
along with an energy e↵ective register file hierarchy help bypass renaming the
register operands that have short live-ranges; this technique is discussed in detail
in Chapter 4. Compiler support also helps eliminate branch prediction lookups
done by non-control operations. Section 3.3.2 describes how the CG-OoO core
minimizes the branch prediction lookup tra�c.
3.2 Constructing Blocks for CG-OoO
In the CG-OoO model, dynamic scheduling is done at code block granularity (rather
than instruction granularity) where each code block consists of a group of instructions
with an optimized static schedule. Block-level dynamic scheduling requires special
support from the compiler to cluster instructions. In this section I explain how the
compiler clusters instructions into code blocks. Recall that each code block, in this
work, is a control-flow basic-block.
3.2.1 Block Boundary Annotation
CG-OoO requires means to identify blocks of instructions. The compiler generates
a special instruction called head to specify the start of each code block. Upon each
head fetch, the front-end allocates an available BW for the new block of code and
groups instructions into code blocks by steering all upcoming instructions to the BW.
Figure 3.2 shows the head instruction format. Its fields are:
• Opcode: 6-bit opcode value
• HasCtrl: 1-bit value indicating if the code block holds a control instruction
• BlkSize: 5-bit value indicating the number of operations in the block
• Immediate: 52 least significant PC bits of the control instruction in each code
block
Figure 3.3 shows an example code highlighting head; it shows the number of
instructions, excluding head is six, HasCtrl = 0’b1 indicating a control operation
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 16
Opcode Fall-Through Block Offsethead BlkSize063 58 57 56 52 51
HasCtrl
Figure 3.2: The head instruction format. HasCtrl is a 1-bit value indicating if a validvalue exists in the immediate field of the instruction. BlkSize holds the number ofstatic instructions in the block, excluding head. The immediate field is the leastsignificant 52 bits of the control operation address at the end of the block. Theimmediate is used to hash into the BPU tables.
exists at the end of the block, and the least significant 52 bits of the control operation
address (i.e. bne).
In case a block does not end with a control instruction (i.e. it has a fall-through
path only), HasCtrl is set to 0’b0 to disable a Branch Prediction Unit (BPU) lookup
and instead continue fetching the fall-through block which is stored immediately after
this block.
The head instruction serves a few key purposes:
1. It specifies the boundaries of code blocks to the processor at runtime.
2. It holds the number of block instructions in BlkSize. As will be discussed in
Section 3.3, this number is used to track when all instructions of the block com-
plete their execution, making the block ready to retire. Five bits are allocated
to BlkSize as the compiler assumes the largest code block can be 32 instruc-
tions. The compiler breaks larger blocks into multiple smaller blocks each with
at most 32 operations. Bird et al. [5] shows the average size of basic-blocks
in the SPEC CPU 2006 integer and floating-point benchmarks are 5 and 17
operations respectively making 32 instructions / block a su�ciently large size.
3. The Immediate field is used to hash into the BPU tables during the prediction
stage to predict the next code block. The motivation behind looking up the
BPU using head rather than control operations is described in Section 3.3.
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 17
int index = -1;
do {index += 1;value_array [index] -= 1;
} while (index != MAX_ITER);
LOOP:0xF00 head 0b’1, 0x6, 0x380xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 sub r1, r1, #10xF28 sw r0, r10xF30 bne r2, r3, LOOP
Figure 3.3: A simple do-while loop that updates the values of an array (left). Theassembly version of the program (right) shows the head instruction as the first in-struction in the basic-block.
3.2.2 Code Generation
To build the CG-OoO processor architecture, two code generation approaches are
possible. The first approach is to embed code blocking semantics into the program
binary during the static compilation process. The second approach is to construct
code blocks through runtime dynamic code optimization.
In case of static code optimization, which to date, has been the common method,
the addition of head to any given Instruction Set Architecture (ISA) is the only
necessary addition; this may be an acceptable change for an energy aware architecture
provided the amount of energy saving opportunities it can deliver.
In case of hardware level dynamic code optimization, for architectures like the
NVIDIA Project Denver [1, 6, 12, 15], no compiler-level ISA modification will be re-
quired; this is because code block detection and annotation can be done dynamically
during the dynamic program profiling stage with almost no extra data-collection cost
as the hardware profiler simply annotates blocks using the information it already
collects on control operations. Such processors dynamically post-process the original
program ISA (e.g. ARM or X86) into a low-level micro-code ISA; the micro-codes
are scheduled into dynamic code sequences that execute with substantially higher
performance and energy e�ciency than the conventional OoO processors.
The choice between the above two alternatives, in practice, depends on the re-
quirements and constraints of a particular processor architecture design. In this work,
I use the former alternative.
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 18
3.3 Execution Model
This section discusses the execution model of CG-OoO when instructions are grouped
into basic-blocks.
Figure 3.4.A shows the control flow graph of a simple do-while loop that walks
through an array values to find the first occurrence of a specific value, ELEM VALU.
At runtime, the loop is unrolled many times to expose the instruction and data-level
parallelism of the loop to the hardware.
Figure 3.4.B illustrates the processor pipeline stages. The highlighted stages are
the key di↵erentiations with respect to the conventional OoO processor model. These
stages are described in the example provided in Section 3.3.1.
Figure 3.4.C shows the high-level model of a two-wide CG-OoO core. The CG-
OoO processor front-end unrolls multiple loop iterations of the program by specu-
latively fetching, decoding, and register renaming instructions. In parallel with the
instruction renaming stage, upon reading a head operation in the fetch sequence, the
Block Allocator unit allocates an available Block Window (BW) to store the upcom-
ing instructions; in this figure, the two Block Windows are marked BW0 and BW1. It
also reserves a new block entry in the Block Re-Order Bu↵er (BROB) which is used
to maintain program order at block-level granularity. The Instruction Steer stage
dispatches upcoming instructions to the appropriate BW. The Instruction Scheduler
unit visits the head of each BW to issue ready instructions to an available execution
unit (EU). Once instructions complete execution, their results are written back into
a register or a store-queue entry, and at the same time, an energy e↵ective wakeup
unit updates each BW with the most recent changes in the program context. Finally,
the commit stage retires a code block once (a) it reaches the head of BROB, and (b)
all its operations complete the write-back stage.
3.3.1 Program Execution Flow in CG-OoO
Figure 3.5 is a cycle-by-cycle example of the instruction flow through the CG-OoO
pipeline stages. This Figure considers two consecutive iterations of the abovemen-
tioned do-while loop. Similar to Figure 3.4.C, the processor model in this example
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 19
BW0
LOOP:0xF00 head 0b’1, 0x4, 0x280xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 bne r2, r1, LOOP
HEAD:0xF00 head 0b’1, 0x4, 0x280xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 bne r2, r1, HEAD
HEAD:0xF00 head 0b’1, 0x4, 0x280xF08 add r3, r3, #10xF10 sll r0, r3, #30xF18 lw r1, r00xF20 bne r2, r1, HEAD
BW1
EU EU
Instruction Scheduler
Block Predict / Fetch / Decode / Rename
Write Back / Block Commit
(A)Control-Flow
Graph
(C)CG-OoO
Execution Model
int index = -1;
do {index += 1;
} while (values[index] != ELEM_VALU);
BLOCKPREDICTION FETCH DECODE RENAME EXECUTE WRITE
BACKBLOCK
COMMIT
(B)CG-OoO
Pipeline Stages
INSTRUCTIONSTEER
Block Allocator
BLOCKALLOCATION
Instruction Steer
BRO
B
Figure 3.4: (A) control flow graph of a simple do-while loop, (B) the pipeline stagesof CG-OoO; the highlighted blocks show the key di↵erences with respect to the OoOcore, (C) the high level execution flow stages in the CG-OoO model with an issuewidth of 2 instructions / cycle, one per execution unit (EU). Each BW, in this archi-tecture example, is assumed to have two write and one read ports. Also, it has oneexecution cluster (see Figure 3.1 for details).
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 20
CYCLE BR PRED FETCH DECODE RENAME DISPATCH EXECUTE WB COMMIT BW0 BW1 BROB1 head, add2 head sll, lw head, add3 bne, head sll, lw head, add head14 head add, sll bne, head sll, lw add add head15 lw, bne add, sll bne, head sll, lw add sll, lw head1, head26 lw, bne add, sll bne sll add lw, bne head1, head27 lw, bne add, sll lw sll bne add, sll head1, head28 lw, bne add bne sll, lw, bne head1, head29 sll add bne lw, bne head1, head210 lw sll bne bne head1, head211 bne lw bne head1, head212 bne bne head1, head213 head bne head214 bne lw head215 bne head216 head
Figure 3.5: Instruction flow diagram for the example loop provided in Figure 3.4.A.This figure shows the flow of instructions in two consecutive iterations of the loop.Instructions in the first and second iterations are colored green and red respectively.The green table (right) shows the contents of BW0, BW1, and BROB at each cycle.head1 and head2 correspond to the BROB entries for the two loop iterations. WBstands for the write-back stage.
is a two-wide superscalar machine with the ability to issue one instruction per BW
per cycle to the two EU’s. Instructions in the first and second iterations are colored
green and red respectively.
In this example, all instructions, but lw, are assumed to be single-cycle operations,
and lw is a four-cycle long operation. As a result, because bne is data-dependent on
lw, its issue is delayed by four cycles when the load value is returned.
In cycle 1, {head, add} instructions are fetched from the instruction cache. The
immediate value in head is used to predict the next code block in cycle 2 after it
is fetched. Notice head speculates the next code block before the control operation,
bne, is fetched on cycle 3. head flows through the pipeline stages until it reaches
the Rename stage in cycle 3 at which point the Block Allocator assigns BW0 to the
instructions following head. At the same time, head reserves an entry in the Block
Re-Order Bu↵er (BROB) where it holds the status of the code block as its instructions
make progress through the pipeline (see head1 in the BROB). head1 is available to be
retired as soon as all instructions in the associates block complete their execution and
write-back their results to either a register or a store-queue entry. The same sequence
of events applies to the next loop iteration. The first block retires in cycle 13 and the
second one retires in cycle 16. The block speculation model in CG-OoO makes all
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 21
computations speculative. Upon retiring a block, all registers and store-queue data
generated by the block operations will be marked non-speculative.
The green table in Figure 3.5 shows the contents of BW0, BW1, and BROB over
time. For example, BW0 receives its first instruction in cycle 4 which is immediately
issued in cycle 5; more instructions from the same code block join BW0 in cycle 5.
The last instruction of the first loop leaves BW0 in cycle 10. BW0 and BW1 become
available to hold new code blocks in cycles 11 and 14 respectively. In cycles 12 and
13, no instruction is available to be issued because BW0 is empty and BW1 is stalling
on the lw instruction.
3.3.2 Control Speculation in CG-OoO
Control speculation allows processors to initiate the fetch of future instructions before
having completed the execution of current control instructions in the pipeline. Control
speculation improves processor front-end performance by avoiding fetch stall cycles.
The Branch Prediction Unit (BPU) is in charge of this task. OoO processors perform
BPU lookups immediately before every instruction fetch to avoid fetch stall cycles
irrespective of the instruction types in the fetch group [48]; this leads to excessive
energy cost on control speculation and redundant BPU lookup tra�c by non-control
instructions which in turn may cause lower prediction accuracy due to aliasing [37].
As pointed out earlier, in CG-OoO, head is the only instruction used to access
the BPU. Since head is usually ahead of its branch operation by at least one cycle,
often times, the probability of fetch stall cycles is low. This probability, however,
depends on the fetch-width of the processor and the common size of code blocks in
the application program. For example, in Figure 3.6, when the fetch-width is 2, head
and bne are fetched in two consecutive cycles. On the contrary, when the fetch-width
is 4, head and bne are fetched in the same cycle which in turn causes a fetch stall
cycle due to delayed prediction. As a result, the two processors have equal front-end
performance. In Chapter 6, the e↵ect of fetch stalls due to delayed branch prediction
lookup is evaluated.
Unlike BPU lookups by head operations, updates to either the branch predictor
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 22
or branch target bu↵er (BTB) during the WB stage, use the control instructions.
LOOP:0xF00 head 0b’1, 0x6, 0xF180xF08 add r3, r7, #10xF10 sll r0, r3, #30xF18 bne r2, r0, LOOP
FW = 2CYCLE BR PRED FETCH DECODE
1 head, add2 head sll, bne head, add3 head, add sll, bne4 head sll, bne head, add5 sll, bne
FW = 4CYCLE BR PRED FETCH DECODE
1 head, add, sll, bne2 head head, add, sll, bne3 head, add, sll, bne4 head head, add, sll, bne5
Figure 3.6: A simple loop (left). Font colors represent two consecutive iterations ofthe loop. The cycle-by-cycle instruction flow through the BPU, FETCH, DECODEpipeline stages (right); the top and bottom tables assume a CG-OoO processor withfetch-width = 2 and fetch-width = 4 respectively.
3.4 Squash Model
CG-OoO supports control and memory speculation. The squash process for the two
cases is slightly di↵erent. Here the two squash models are discussed separately.
3.4.1 Squash due to Control Mis-speculation
Squash events are handled at block-level granularity. Upon detecting a branch mis-
prediction, when the branch is in the execution pipeline stage, the front-end stalls
fetching new instructions, all code blocks younger than the mis-speculated control
operation are flushed from the pipeline, and the remaining code blocks are retired.
Once the BROB is empty, the processor state is non-speculative and the e↵ect of
wrong-path operations are discarded. At this stage, the processor can safely resume
normal execution by fetching new blocks from the instruction cache. For example, in
Figure 3.7, cycle 14 is when the processor can restart fetch.
Figure 3.7 shows the squash process through an example which illustrates the same
code sequence as the one in Figure 3.5 except that in this case the second iteration
is assumed to be mis-predicted by head in cycle 2. In cycle 11, bne is executed,
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 23
CYCLE BR PRED FETCH DECODE RENAME DISPATCH EXECUTE WB COMMIT BW0 BW1 BROB1 head, add2 head sll, lw head, add3 bne, head sll, lw head, add head14 head add, sll bne, head sll, lw add add head15 lw, bne add, sll bne, head sll, lw add sll, lw head1, head26 lw, bne add, sll bne sll add lw, bne head1, head27 lw, bne add, sll lw sll bne add, sll head1, head28 lw, bne add bne sll, lw, bne head1, head29 sll add bne lw, bne head1, head210 lw sll bne bne head1, head211 bne lw bne head1, head212 bne bne head1, head213 head bne head214 bne lw head215 bne head216 head
Figure 3.7: Instruction flow diagram for the example loop provided in Figure 3.4shown in case of a squash event. This figure shows the flow of instructions in thefinal iteration of the loop. Instructions in the two iterations are colored green andred respectively. Due to a branch mis-speculation, the final loop iteration is squashedand the entries with a strikethrough are never executed. The gray box is the time atwhich the branch mis-speculation is detected.
and in cycle 12 its result is compared against the speculated block program-counter
(BPC). In case of a conflict, a squash flag is raised to flush all mis-predicted, in-
flight operations from the second loop iteration and to remove the head2 entry from
BROB. The squash event also cancels all future activity by younger instructions (see
the strikethrough operations in Figure 3.7).
As noted earlier, in the write-back stage, instructions write their speculative re-
sults into a register (or a store-queue) entry. These results are marked non-speculative
only after their corresponding block retires. Thus, through the above execution pro-
cess, the data produced by wrong-path blocks are automatically discarded as such
blocks never retire. The values produced by add and sll in the WB stage in cycles
9 and 10 of Figure 3.7 are discarded.
3.4.2 Squash due to Memory Mis-speculation
Similar to the conventional OoO processors, the memory interface in CG-OoO consists
of a load-store-queue (LSQ) that operates at instruction granularity. The memory
mis-speculation detection follows the conventional model where once a store operation
detects a conflict with a younger load operation, a squash event is triggered. Handling
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 24
memory mis-prediction events, however, di↵ers from branch mis-speculation in that
the squash process is initiated at the start of the block holding the mis-predicted lw
operation meaning all younger blocks including the block lw is part of are flushed. As a
result, the older instructions that coexist in the same block as the lw are also squashed.
Since those instructions are older than the lw, they would not have been flushed in
the OoO model. The flush of useful operations is named wasted computation. In
order to reduce wasted computation, the compiler schedules lw operations as close as
possible to the top of the code block in order to avoid as much wasted computation
as possible.
3.5 Static Instruction Scheduling
As shown in Figure 3.4, the Instruction Scheduler unit visits the head of each BW
to find ready instructions to issue. The limited view of the dynamic scheduler into a
BW can inhibit it from finding ready instructions that may be blocked behind long
memory latency dependent operations in a BW. This problem can be mitigated if each
code block held an optimized code sequence that avoided poor dynamic scheduling
due to head-of-queue stalls.
In this work, I use static instruction list scheduling on each code block to enable
significant improvements in the processor performance (a) by optimizing the static
schedule along the critical path instructions in each block, (b) by improving memory
level parallelism via hoisting memory operations as close as possible to the top of
their code block, and (c) by minimizing wasted computation due to memory mis-
speculation. The impact of instruction scheduling on performance of CG-OoO is
discussed in detail in Chapter 6.
3.6 Sources of Parallelism in Cg-OoO
CG-OoO benefits from a hybrid of static and dynamic parallelism opportunities.
Here, I discuss the sources of these opportunities.
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 25
Memory Level Parallelism (MLP)
Since memory operations from di↵erent BW’s can be issued in parallel, CG-OoO
supports memory level parallelism. Figure 3.5, shows MLP in cycle 10 when the
two lw operations are in-flight at once. MLP is especially e↵ective during cache-miss
events. To further improve MLP, the compiler statically hoists memory operations
toward the head of their block to help the dynamic scheduler issue them earlier.
Block-Level Parallelism (BLP)
Block-level out-of-order execution manifests itself in the form of having multiple BW’s
issuing instructions to hide each other’s head-of-queue stall latency. For instance, in
Figure 3.7, in cycles 8, 9, 10, instructions from BW1 hide the latency of the lw from
BW0.
Instruction Level Parallelism (ILP)
ILP in the context of CG-OoO execution refers to instruction issue parallelism within
a code block. This type of parallelism is not presented in this chapter and is not
included in Figures 3.1 and 3.4. In Chapter 4, I elaborate on two energy e↵ective
techniques that improve the CG-OoO performance by allowing instruction level par-
allelism. The simplest model for instruction level parallelism is the case where more
than one instruction can be issued from each BW per cycle in-order. For instance,
if two consecutive instructions, at the head of a BW, were ready to issue, the sched-
uler would issue them both. The more involved model allows limited bypass across
head-of-queue stall instructions in order to find stall-independent instructions. Both
techniques are designed such that they avoid the energy cost and complexity of the
dynamic OoO scheduling.
CHAPTER 3. COARSE-GRAIN OUT-OF-ORDER EXECUTION MODEL 26
3.7 Chapter Summary
I presented the CG-OoO processor execution model through an example that illus-
trates how instructions flow through the pipeline, how the squash unit rolls back
the execution, and how the static and dynamic instruction scheduling cooperate to
provide high performance dynamic execution. I also described the sources of energy
saving in the CG-OoO processor.
Next, the architectural details of the CG-OoO processor and how each design
element contributes to the overall performance and energy behavior of the processor
relative to the OoO processor is discussed.
Chapter 4
System Architecture
In this chapter, I discuss the architectural features of the CG-OoO processor model
and elaborate on how di↵erent design decisions contribute to saving energy in each
stage.
4.1 System Architecture
All major architectural units in the Cg-OoO processor are presented in Figure 4.1.
The front-end consists of the Fetch, Decode, Register Rename, Block Allocation,
and Instruction Steer units. It processes instructions in-order and clusters them into
code blocks that are processed by the back-end. The back-end consists of the Block
Windows (BW), Instruction Scheduler, Execution Units, Load-Store Unit (LSU), and
Block Re-Order Bu↵er (BROB). This chapter, provides details on each of these units,
describes how they communicate with each other, and highlights the design techniques
that enable energy saving.
4.2 Pipeline Stages
This section focuses on presenting the micro-architectural details of each stage and
the communication patterns between stages presented in Figure 4.1
27
CHAPTER 4. SYSTEM ARCHITECTURE 28
32KB L1 Instruction CacheBranch
PredictionUnit
Fetch
Instruction Decode
Register Rename
BW
Instruction Steer
Instruction Scheduler
EU EU EU EU
32KB L1 Data Cache
256KBL2
Cache
B-ROB
LSU
Head PC
BW BWBW
Block AllocationFron
t-end
Back
-end
Figure 4.1: Detailed Micro-architecture of the CG-OoO processor. The highlightedblocks are the key di↵erentiators of this processors compared to the OoO processor.The register file hierarchy is encapsulated in the BW modules. See Figure 4.11 fordetails.
CHAPTER 4. SYSTEM ARCHITECTURE 29
4.2.1 Branch Prediction
Speculative execution enables latency hiding through accurate branch prediction.
This allows the processor front-end to speculatively run ahead of the current program
execution stage and provide dynamic code for the processor back-end to run during
unpredictable, long latency cache-miss events. As will be shown later in this chapter,
to save speculation energy, the Branch Prediction Unit (BPU) in the CG-OoO pro-
cessor limits speculation lookups to one per dynamic code block ; this is significantly
more energy e�cient than that of the OoO processor where BPU lookup is done at
every instruction fetch event. Chapter 6 quantifies the energy benefits of block-level
speculation.
Figure 4.2 shows the micro-architectural details of the branch prediction stage in
the CG-OoO processor; it consists of the 2Bc-gskew Branch Predictor (BP) [48], the
Branch Target Bu↵er (BTB), Return Address Stack (RAS), and the Next Block-PC
block. The next code block PC is computed through Equation 4.1.
PCNext�head = PChead + fall-through-block-offset (4.1)
The fall-through-block-offset is held as an immediate field of the head in-
struction as previously shown in Figure 3.2. This value is generated at compile time
to help compute the next head PC.
In contrast with the conventional BPU access approach where every fetch group
PC would be used to access the BPU, in the CG-OoO model, only head PC’s access
the BPU. Upon lookup, a head PC is used to predict the next head PC. Specu-
lated PC’s are pushed into a FIFO queue structure dedicated to communicate block
addresses to the fetch unit named Block PC Bu↵er. Section 4.2.2.1 presents a de-
tailed processor front-end runtime example in which the branch predictor behavior is
elaborated.
When control operations complete execution, they verify the prediction correct-
ness; if the prediction was correct, the corresponding BPU entry(ies) would be rein-
forced and if it was incorrect, it would be reversed. However, since in the prediction
phase, the BPU is indexed by head PC’s, it is necessary that BPU updates also index
CHAPTER 4. SYSTEM ARCHITECTURE 30
BIM
META
Branch Prediction Unit
INS CacheUnit
INS Decode
Unit
BTB
head PC
RAS
Next Block PC
Next Fetch Block Address Prediction
Instruction Cache Access
Instruction Decode Stage
Next fetch address from exception or branch misprediction
Block PC Buffer
G-Share
fall-through-block-offset
Figure 4.2: Branch Prediction Unit (BPU) micro-architecture. Next Block PC com-putes the next fall-through block PC. Block PC Bu↵er holds the outstanding Blockaddresses the Fetch Stage must consume. In this model, each head PC is used topredict the next head PC. Conditional operations within each block would be usedto confirm predictions, and if needed, update table entries.
the table using the same PC’s. To do so, control operations access their corresponding
BPU entry(ies) by first computing their head PC using Equation 4.2.
PChead = PCcontrol�op
� code-block-offset (4.2)
The remainder of the branch prediction update process is similar to conventional
CPU models; the global prediction history queue is speculatively updated at pre-
diction time and the BP, BTB, and RAS table entries are updated once control
instructions complete execution.
CHAPTER 4. SYSTEM ARCHITECTURE 31
Upon a squash event, conventional branch predictors would undo the global his-
tory queue, the only speculatively updated unit in the BPU. In the CG-OoO archi-
tecture, the squash protocol also flushes the Block PC Bu↵er.
By only accessing the BPU through head operations, the CG-OoO branch predic-
tor is 53% more energy e�cient than that of the OoO model (see Figure 6.12). Recall
from Chapter 2 that the OoO model accesses the BPU on every fetch group. Chap-
ter 6 evaluates the energy and performance characteristics of the CG-OoO branch
predictor.
4.2.2 Fetch Stage
The Fetch stage accesses the instruction cache to load future instructions for the
processor back-end to execute. Depending on the processor width, Fetch may load
di↵erent number of instructions per cycle (e.g. 2, 4, or 8 instructions). The CG-OoO
Fetch stage loads code blocks from the instruction cache after receiving code block
addresses from the BPU. Code Block addresses are pushed to the Block PC Bu↵er by
the branch predictor and popped by the fetch front-end. When the Block PC Bu↵er is
full, the BPU stalls and waits for fetch to drain the bu↵er. When this bu↵er is empty,
the fetch unit stalls and waits for the BPU to produce new code block addresses.
Figure 4.3A illustrates a control flow graph with five basic-blocks. Each block is
marked with its head identifier, h, at the top, and its control operation identifier (if
any), c, at the bottom. Figure 4.3B illustrates the mapping of these basic-blocks to
the instruction cache where each rectangle represents a 64-bit instruction and each
group of four adjacent instructions with the same color shade represent a fetch group.1
Instructions entries marked I show the mapping of non-control, non-head operations
within the basic-blocks shown in Figure 4.3A.
In order to correctly fetch code blocks, the fetch unit supports three block fetch
scenarios:
1. Fetch-group with zero head instruction
2. Fetch-group with one head instruction
1This is assuming that the instruction cache does not support fetch-alignment [32].
CHAPTER 4. SYSTEM ARCHITECTURE 32
3. Fetch-group with more than one head instruction
The mapping of instructions to the cache model in Figure 4.3B includes examples
of all three fetch scenarios listed above. All cache lines holding no head operation
represent case 1; the fetch groups containing h1 and h2 represent case 2; the fetch
group containing {h3, h4} represents case 3.
h1
c1
h2
c2
h3
h4
c4
B2 B3
B1
B4
(A)
I I I I I I I Ih2 II c1I Ih1 Ic2 h3 I h4 I I I III II II II I I I I I c4 h5I II II II I
Instruction Cache
(B)
I
h5
c5
B5
I I I I I I I c5I II II II I
Figure 4.3: (A) A control flow graph with five blocks labeled B1 to B5. Each block hasa head operation labeled with h1 to h5. c1 to c5 corresponds to control instructionmembers of code blocks. (B) Mapping of operations in the control flow graph ontoan instruction cache. This figure assumes a 4-wide fetch unit with no fetch alignmentsupport; instruction sequences in the same color shade would be fetched together. Icorresponds to non-head and non-control instruction members of code blocks.
The Block PC Bu↵er can hold either a PC value or a 0x0 value. A PC value
prompts the fetch unit to start the next block fetch from the specified address. A
0x0 value indicates the next-block PC value is unknown. This situation happens
when a head operation, hh, is predicted not-taken. For the predictor to identify the
fall-through block PC (Equation 4.1), it needs to have access to the fall-through-
block-offset field of hh which may not have yet been fetched. When the fetch stage
encounters this situation, it assumes the fall-through block is the block immediately
CHAPTER 4. SYSTEM ARCHITECTURE 33
after the hh block in the memory. So, it continues fetching the next block while the
fall-through block address for hh is computed. If the fetch unit completes the next-
block fetch before the fall-through block address is computed, fetch stalls until this
information becomes available. The next section describes the mechanism to predict
head PC’s via an example.
4.2.2.1 Instruction Fetch Example
Assume the program control flow graph in Figure 4.3A and the 4-wide fetch cache
model in Figure 4.3B. Furthermore, assume a prefect branch predictor produces the
following sequence of dynamic code blocks to be fetched: {B1! B3! B4! B1! B2
! B4}. For this fetch example, the sequence of predict, fetch, and decode activities
along with the Block PC Bu↵er contents are shown in Figure 4.4. This example
assumes the Block PC Bu↵er initially holds the PC of h1 from past predictions (cycle
0). The instruction cache starts by fetching B1 in cycles 1-2. In cycle 2, the fetch of
B1 completes via detecting h2. In cycle 3, fetch starts at the cache line where h3 is
located. At the end of cycle 3, the fetch of B3 completes via detecting h4. Since B3 is
predicted not-taken, the fetch unit continues fetching the B4 instructions even though
the fall-through destination address (i.e. h4 PC) is still unknown. The address for h4
would be verified in cycle 4 when the block o↵set value for h3 is available. In cycle
8, the fetch of B4 completes via detecting h5. In cycle 9, B1 is fetched. Its fetch is
completed in cycle 10, though since B1 falls through B2, the fetch unit continues its
fetch until it reaches cycles 15. In cycle 15, the fetch for B2 completes via detecting
h3. B4 is the block to be fetched after B2. Because h4 is located in the fetch-group
that was just fetched, the fetch unit continues its fetch for B4.
In cycle 1, h3 is predicted using the h1 PC. In cycle 2, h3 is predicted not-taken,
though its destination address remains unknown until after h3 is fetched. In cycle 4,
the fall-through address becomes available as h4 which replaces the 0x0 entry at the
head of the Block PC Bu↵er. h4 is used to predict h1 as the next block address. Since
h1 predicts not-taken, its destination address would be unavailable until cycle 9 when
the fall-through o↵set of h1 is available after fetching h1. In cycle 10, h2 is computed,
which replaces the 0x0 entry at the head of the Block PC Bu↵er. In cycle 11, h2 is
CHAPTER 4. SYSTEM ARCHITECTURE 34
CYCLE BR PRED FETCH DECODE Block PC Buffer0 h11 h3 h1, I, I, I h32 not-taken I, c1, h2, I h1, I, I, I h3, 0x03 c2, h3, I, h4 I, c1, h2, I 0x04 h4 I, I, I, I c2, h3, I, h4 h45 h1 I, I, I, I I, I, I, I h16 not-taken I, I, I, I I, I, I, I h1, 0x07 I, I, I, I I, I, I, I h1, 0x08 I, I, c4, h5 I, I, I, I h1, 0x09 h1, I, I, I I, I, c4, h5 0x010 h2 I, c1, h2, I h1, I, I, I h211 h4 I, I, I, I I, c1, h2, I h412 h1 I, I, I, I I, I, I, I h4, h113 h3 I, I, I, I I, I, I, I h4, h1, h314 not-taken I, I, I, I I, I, I, I h4, h1, h3, 0x015 c2, h3, I, h4 I, I, I, I h4, h1, h3, 0x016 I, I, I, I c2, h3, I, h4 h1, h3, 0x017 I, I, I, I I, I, I, I h1, h3, 0x018 I, I, I, I I, I, I, I h1, h3, 0x0
Figure 4.4: An example code fetch sequence that follows the code example in Fig-ure 4.3. The left table shows the sequence of events in the BPU, Fetch, and Decodeunits over time. The right table shows the content of the Block PC Bu↵er over time.
popped from the Block PC Bu↵er to verify the fall-through block being fetched is
valid, and to predict h4 as the next block to fetch. Following the above prediction
mechanism, in cycles 13 and 14, blocks h1 and h3 are predicted respectively. In cycle
14, h3 is predicted not-taken. The BPU then stops predicting more PC’s until when
the corresponding instance of h3 is fetched.
In summary, the above prediction mechanism predicts the next block using the
current block PC until the point where a block is predicted not-taken. At that point,
the BPU stalls until the fall-through block o↵set becomes available. In such a case,
the fetch unit continues fetching one more block (i.e. the fall-through block) while
waiting for the Next-Block PC to be computed.
If the Inequality 4.3 holds, the fall-through destination address would be known
CHAPTER 4. SYSTEM ARCHITECTURE 35
prior to the fall-through block fetch cycle.
BS � FW ⇥ FPD (4.3)
BS represents the Block Size, FW represents the Fetch Width, and FPD represents
the Fetch Pipeline Depth. When BS is so small that this relationship does not hold,
to avoid fetch stalls due to unknown fall-through addresses, the BPU pushes a 0x0
into the Block PC Bu↵er. This value notifies the fetch unit that it must continue its
fetch upon detecting a new block assuming the next observed code block is the fall-
through value. When the fall-through o↵set value becomes available, the Next Block
PC generator verifies the expected fall-through address matches the next fetched
block. The compiler also guarantees that fall-through codes always follow each other
in the binary code. The adversarial case that can stall the fetch pipeline is when
several very small code blocks predict not-taken. Such block sequences, however, are
rare to occur in practice as shown in Chapter 6.
4.2.2.2 Fetch Alignment
Figure 4.3B shows operation h3 as the second word in its fetch group. When c2
jumps to h4, the fetch unit could fetch instructions {c2, h3, I, h4}. In this case, the
Decode Stage would turn c2 into a NOP. Such operations are highlighted in red in
the Decode column of Figure 4.3B. Alternatively, the fetch unit could start fetching
from h3, fetch an additional operation from the next fetch group, align h3 to be the
first operation in the fetched group, and send the four operations {h3, I, h4, I} to
the Decode stage. The Cg-OoO front-end performs the latter fetch alternative to
maximize the number of useful instructions fetched per cycle. Though, for the sake
of simplicity, the example in Figure 4.4 does not support such fetch alignment.
4.2.2.3 Early head Detection & Forwarding
Figure 4.5 illustrates the head forwarding logic immediately after instruction cache
access. It compares every instruction opcode against the head opcode and if there
is a match, it sends the head’s fall-through-block-offset field to the BPU. By
CHAPTER 4. SYSTEM ARCHITECTURE 36
transferring head’s content before the fetch group is Decoded, this simple logic unit
minimizes the delay to generate the fall-through address. Notice early head detec-
tion assumes a fixed bytecode ISA. If an ISA generates a variable-size bytecode, the
fall-through-block-offset would be forwarded to the BPU at the end of the De-
code stage.
InstructionDecode
InstructionCache
(4-Wide Fetch)
Opcodehead
Opcode
Opcode
headOpcode
Opcodehead
Opcode
Opcodehead
Opcode
TowardBPU
Figure 4.5: A simple logic unit used to detect head operations immediately afterfetch. The fall-through-block-offset field of detected head operations would beforwarded to the BPU for Next Block PC computation.
4.2.3 Decode Stage
The instruction decode micro-architecture follows that of the conventional OoO ar-
chitecture model. One di↵erence is that in the CG-OoO processor, this stage distin-
guishes between global and local register operands by appending a 1-bit flag, called
CHAPTER 4. SYSTEM ARCHITECTURE 37
the Register Rename Flag (RRF), next to each register identifier. For local operands
it sets the value to 0 and for global operands it sets it to 1. The register rename
looks up this bit to determine which operands must visit the Register Rename stage.
Figure 4.6 shows the RRF field.
As discussed earlier, in Figure 4.4, the operations marked red in the Decode col-
umn are invalid and to be discarded from the execution flow. This stage identifies
invalid operations and turns them into NOP’s. For example, in cycle 16, operations
belonging to h3 would be discarded as B3 is not part of the execution sequence at
that time. While the fetch of some of these operations would be avoided via fetch
alignment, some still remain for the Decode Stage to handle.
4.2.3.1 Fetch vs. Decode Width in CG-OoO
As mentioned in Chapter 3, head operations are used to identify block boundaries
for the Block Allocator Stage. Beyond that stage, head operations have no utility.
As a result, once head reaches the Block Allocation Stage, it is discarded from the
execution pipeline.
To maintain equivalent front-end performance between the OoO processor and
CG-OoO processor, it is critical they both dispatch the same number of operations on
average. Because the addition of head operations to the Instruction Set Architecture
(ISA) reduces e↵ective fetch bandwidth of the front-end, the analysis provided in
Chapter 6 assume a 6-wide Fetch unit for the CG-OoO processor and a 4-wide Fetch
unit for the OoO processor. However, their Register Rename stages remain 4-wide to
maintain dispatch fairness between the OoO and CG-OoO processors.
4.2.4 Register Rename / Block Allocation Stage
4.2.4.1 Register Rename
As mentioned earlier, the Decode stage identifies and tags local / global register
operands for each instruction. In this stage, if an instruction holds a global register
operand, it accesses the register rename tables to receive its physical register. Oth-
erwise, the instruction would skip the register rename lookup. Figure 4.6 shows the
CHAPTER 4. SYSTEM ARCHITECTURE 38
control logic for a single instruction operand choosing to access or bypass the Reg-
ister Rename stage. If the RRF is high, the register must be renamed, otherwise
its statically allocated value is used. Statically managed registers access a register
file structure called, the Local Register File. Renamed operands access a register file
structure called, the Global Register File. The register file hierarchy in this work is
further discussed in Section 4.2.7.1.
Operands that need to be renamed follow the conventional Merged Rename and
Architectural Register File renaming model in the Alpha 21264, MIPS R12000, and
Pentium IV processors [17, 22, 27]. The only di↵erence is that updating the com-
mit Register Alias Table (commit-RAT) happens at block granularity, meaning all
global write operands, belonging to a committing code block, commit together. If
the number of write-ports to the commit-RAT is smaller than the total number of
global write operands in a code block, commit of the code block may take multiple
cycles. To support block-commit, each entry in the Block Re-Order Bu↵er (BROB)
stores its global write register operands. More details on BROB micro-architecture
is provided in Section 4.2.9.
Skipping the register rename stage reduces the renaming lookup energy by an
average of 30%. This feature is not possible in the OoO core as its execution model
enforces maintaining program data-flow dependency at instruction granularity rather
than block granularity.
4.2.4.2 Block Allocation
After the Decode Stage detects a head operation, Block Allocator, finds an available
Block Window (BW) to host upcoming block operations. If no BW is available to
be allocated, the processor front-end stalls. Figure 4.7 shows the state transition
diagram for the block manager. Each block is initially empty and Available. Once a
Block Window is selected by the allocator to store a code block, it transitions to the
Busy (Fetching) stage where it would continue to receive instructions until the last
instruction in the code block is fetched. As mentioned earlier, the end of a block is
detected via detecting the next head operation in the fetch sequence. Then, the BW
moves to the Busy (Done Fetch) state. It stays in that state until all instructions
CHAPTER 4. SYSTEM ARCHITECTURE 39
Register Rename
RRF REG
P-REG
Figure 4.6: The register rename bypass logic. When the Register Rename Flag (RRF)bit is high, the register identifier must be renamed. Otherwise, it would skip reaming.The RRF is appended to each register identifier during the Decode Stage. The tri-state driver avoids register rename table lookups for local operands.
CHAPTER 4. SYSTEM ARCHITECTURE 40
held in the BW are issued for execution.
The Block Allocation Stage maintains a FIFO queue of all BW’s in the Available
state. It maintains a register pointer to the BW whose instructions are currently being
fetched. This register will be looked up by the Instruction Steer unit (Section 4.2.5) to
transfer instructions to the appropriate BW destination. In addition, Block Allocator
allocates an entry in the Block Re-Order Bu↵er (BROB).
Notice BW’s move from Busy states to the Available state before all their corre-
sponding operations are completed. This implies the BROB size must be larger than
the total number of BW’s in the processor to lower the chance of structural hazards
due to BROB-full events.
Available Busy(Fetching)
Busy(Done Fetch)
Squash
BW Empty /Squash
Block-fetched in one cycle
Start New Code Block Fetch
Code Block Fetch /Instruction Issue
BW Not Empty(Instruction Issue)
Start
Code Block Fetch Completed(i.e. next head
operation detected)
Figure 4.7: State Transition Diagram of the Block Allocator Unit. The status of eachBlock Window is tracked using this state transition diagram. Available BW’s can beused to store upcoming code blocks from the processor front-end.
Whenever the code block size is smaller than the processor decode width, the
allocated block transitions directly from the Available to Busy (Done Fetch) state.
Upon a squash event, all BW’s holding instructions younger than the instruction
CHAPTER 4. SYSTEM ARCHITECTURE 41
Decoder
Block AllocatorRegister Rename
Is head
?
Figure 4.8: The logic diagram routing head operations to the Block Allocator and allother instructions to the Register Rename Unit.
initiating the squash must be flushed. The state transitions to Available indicate all
such blocks flush and reset their internal context and move to the Available state.
Section 4.3 discusses the BW context reset process.
To shorten the pipeline depth, the Block Allocation and Register Rename stages
are consolidated into a single pipeline stage given the type of instructions processed
by the two units are mutually exclusive. Figure 4.8 shows the interface between the
Decode and Register Rename / Block Allocation stages. All head operations in a
decode group are routed to the Block Allocation unit; the rest are routed to the
Register Rename unit. The demultiplexer select line has one bit for each operation
in the decode group.
Figure 4.9 shows the status signals (dotted arrows) traveling from BW’s and
BROB to the Block Allocator. The solid arrow from the Block Allocator to the
Instruction Steer Switch Network sets the destination BW(s) for upcoming instruc-
tions from the rename stage. Also, memory operations allocate an entry in the LSU.
At the same time, the solid arrow from the Block Allocator to BROB allocates a new
entry in the BROB for the new code block. When a BW completes issuing all its
CHAPTER 4. SYSTEM ARCHITECTURE 42
instructions, it notifies the Block Allocator to switch its state to Available. When
BROB runs out of empty slots, it notifies the Block Allocator to stop allocating new
blocks. When Block Allocator receives a BROB-Full message or runs out of Avail-
able BW’s, it sends a stall-message to the Fetch unit (see the dotted line from Block
Allocator to Fetch). Likewise, when the LSU runs out of space, it notifies the Fetch
unit to stall.
Decoder
BW 0 BW n-1
Instruction Steer Switch Network
Register RenamerBlock Allocator
Decoder
BW
Sta
tus S
igna
lsSt
all F
etch
Fetch
BROB LSU
Figure 4.9: CG-OoO front-end highlighting the control signals used between BlockAllocator and other units to communicate the processor resource utilization status.LSU is the Load-Store Unit, BROB is the Block Re-Order Bu↵er, and BW is theBlock Window.
4.2.5 Instruction Steer Stage
This stage consists of a point-to-point interconnection network that directs instruc-
tions to their corresponding BW destination while maintaining the sequential order
between instructions within a fetch group. For fetch groups where instructions be-
long to more than one BW (e.g. Figure 4.10D,E) the network steers instructions by
CHAPTER 4. SYSTEM ARCHITECTURE 43
selecting the proper BW for each operation separately.
In parallel with steering instructions to their designated BW, the global write
operand identifier of each instruction is copied to its corresponding BROB entry
(i.e. the tail entry of BROB). These physical register identifiers are later used at
commit time to update their corresponding architectural register state. For more, see
Section 4.2.9.
4.2.6 Front-end Examples
Figure 4.10 shows several independent instruction sequences assuming all instructions
in the same line belong to the same fetch and decode group in a four-wide superscalar
CG-OoO processor front-end. The register operand identifiers starting with G refer
to global register operands (prior to register renaming) and the identifiers starting
with L refer to local register operands. In Figure 4.10A, all operations have global
read and/or write operands. These operands are renamed before they are dispatched
to their BW. Block Allocator detects the head operation in this group and assigns
an Available BW to the new code block. In Figure 4.10B, all operands in the fetch
group are local. As a result, the fetch-group can bypass the Register Rename / Block
Allocation stage and move directly to the Instruction Steer stage. On the contrary, in
Figure 4.10C, although all operands are local and bypass the Register Rename stage,
because this fetch group initiates a new code block, instructions wait for the Block
Allocator unit to determine their destination BW. In Figure 4.10D, the first operation
on the left belongs to a di↵erent block than the rest of the fetch group. In this case,
lw L1, Addr bypasses Block Allocation / Register Rename stage and moves to its
destination BW. The rest of the fetch group stalls until the new available BW is
determined. Figure 4.10E shows the case where two heads appear in a fetch-group.
The Block Allocator stage is capable of allocating more than one BW per cycle.
Two Available BW’s are selected from the Available BW’s Queue. Each operation
is individually routed to its destination BW according to the BW ID set for it in
the Instruction Steer stage. Thus, instructions belonging to the two code blocks are
separated and routed into two code blocks at the same time. Because the first code
CHAPTER 4. SYSTEM ARCHITECTURE 44
lw L2, Addr lw L1, L2 add, L2, L1, L1 sub L2, L2, #2
headlw L1, Addr add, L2, L1, L1 sub L2, L2, #2
head bne G1, G2, loop head add, L2, L1, L1
head lw G1, Addr add, G1, G1, L1 sub L1, G1, #2A
B
D
E
head lw L1, Addr add, L2, L1, L1 sub L2, L2, #2C
Figure 4.10: Five example fetch and decode instruction groups showing di↵erentinstruction combinations.
block is smaller than the decode group size, it completes its fetch in one cycle, and
its corresponding BW transitions directly from the Available to Busy (Done Fetch)
state (see Figure 4.7).
4.2.7 Issue Stage
Arguably, the scheduling stage is one of the most challenging units in the OoO pro-
cessor critical path as several events must be handled within a small number of cycles.
In this section, I discuss a complexity e↵ective and energy e�cient instruction issue
(selection, wakeup) mechanism designed for the CG-OoO processor. Such a scheduler
must be fast and energy e�cient in selecting ready operations. It must also support
a fast wakeup mechanism with low energy overhead in activating future ready oper-
ations.
Before focusing on the instruction issue mechanisms, I describe the individual
micro-architectural components of the Issue stage.
4.2.7.1 Issue Micro-architecture
This stage consists of several Block Windows that bu↵er instructions for execution.
As shown in Figure 3.5 and 4.1, BW’s receive operations via the Instruction Steer
CHAPTER 4. SYSTEM ARCHITECTURE 45
Stage, and issue operations via the Instruction Scheduler Unit described later in this
section. Each BW consists of several key components; Figure 4.11 highlights these
components.
Instruction Storage Instruction Queue (IQ) is a FIFO queue that holds code
block instructions. Figure 4.12A shows the fields associated with each IQ table entry.
IQ has a finite size; the compiler splits the static code in case a code block holds
more instructions than this size. Splitting blocks increases block level parallelism by
turning large code streams into two separate code streams. However, it also increases
the Global Register File pressure. Thus, choosing the right block size is essential in
delivering high performance computation while keeping the energy consumption low.
Chapter 6 discusses the energy-performance trade-o↵ for di↵erent IQ sizes.
Head Bu↵er (HB) is a small bu↵er used to hold instructions waiting to be issued
by the Instruction Scheduler. HB pulls instructions from the head of IQ FIFO and
waits for their operands to become ready. Once an operation has all its operands
ready, the Instruction Scheduler removes it from the HB and issues it for execution.
As shown in Section 4.2.7.2, depending on its micro-architecture setup, HB may hold
between one to four instructions. Figure 4.12B illustrates the logical structure of
each operation when stored in the HB; each source operand maintains a Ready bit,
indicating if its data is available, and a 64-bit field to hold its data. The data fields
would either be populated by a register file read or by the wakeup unit as discussed
in Section 4.2.7.1.
Register File Hierarchy Two types of register files exist in the CG-OoO processor:
a Global Register File (GRF), and several Local Register Files (LRF). The GRF is
managed by the Register Rename unit while LRF’s are managed by the compiler. The
GRF holds registers used to communicate data across BW’s. A LRF, on the other
hand, holds register operands used to communicate data between instructions within
the same code block. Each BW maintains a dedicated LRF to save intermediate data
used among its instructions.
CHAPTER 4. SYSTEM ARCHITECTURE 46
InstructionQueue
LRFHead Buffer
BlockWindow
GRFSegment
EUEU
Figure 4.11: Components of a Block Window (BW) consist of the Instruction Queue(IQ), the Head Bu↵er (HB), the Local Register File (LRF), and a Global RegisterFile Segment. EU’s may be shared among multiple BW’s.
IQ Entry
Opcode ImmediateHB Entry R-REG2R-REG1W-REG
Rea
dy
Rea
dy
Source Operands “Ready” Bits
Opcode Immediate R-REG2R-REG1W-REG
A
B
Val
id
Val
id
Val
id
Val
id
Val
id
Val
id
Val
id
Val
id R-REG2DATA
R-REG1DATA
Source Operands “Data”
Figure 4.12: (A) Contents of each Instruction Queue (IQ) entry. Valid bit specifiesif the given field is used by the Opcode; when it is set to 0, the scheduler ignoresthe field. Valid bits are set by the Decoder. R-REG refers to read register identifierand W-REG refers to write register identifier. The identifiers can either be local orglobal. (B) Contents of each Head Bu↵er (HB) entry. When the Ready bits of allValid source operands are set to 1, the operation is ready to be issued.
CHAPTER 4. SYSTEM ARCHITECTURE 47
Issue Pipeline In conventional OoO processors, the select and wakeup units are
among the highly energy consuming hardware units [16]. As mentioned in Chapter 2,
the wakeup unit utilizes long wires to transfer data from producing execution units
to pending instructions in the IQ waiting for their operands. This unit also consumes
energy via accessing large Content Addressable Memory (CAM) arrays in the OoO
processor Instruction Window to wakeup operations dependent on recently generated
results.
Chapter 2 discussed two instruction issue design models where one model spends
an extra pipeline cycle to read results after issue and the other model saves the extra
cycle by reading and storing data in the Instruction Queue immediately after register
renaming. CG-OoO is a middle ground between the two design models where it uses
limited CAM storage space to bring data from register files to instructions once they
are about to be issued. If not all read operand values are available at the register
file lookup time, instructions wait until their operands become available through the
wakeup mechanism.
The instruction issue pipeline stages are shown in Figure 4.13. Once instructions
flow through the Register Rename and Instruction Steer stages, they are pushes to the
IQ of their BW. Once the HB has an empty entry, the BW controller pops the head
of IQ instruction and pushes it to the end of HB. At the same time, the instruction
reads its source operands from the register file(s). If an operand finds valid data in its
register file, the data is copied to the HB source operand data entry, and its Ready bit
is set. If the data is not yet available in the register file, its corresponding Ready bit
remains zero until when the wakeup unit forwards its data. When all source-operand
Ready bits are set, the instruction may be selected by the Instruction Scheduler and
driven to an available Execution Unit for execution. Once an operation is selected
/ popped from the Head Bu↵er, the instruction at the IQ head replaces its spot by
repeating the same steps described here.
The instruction selection stage follows the Oldest Ready Block arbitration proto-
col; it visits BW’s in the order of oldest to youngest dynamic code blocks to find ready
operations. Once it finds as many ready operations as needed to occupy all execution
units, it stops issuing more operations. The baseline OoO processor uses the Oldest
CHAPTER 4. SYSTEM ARCHITECTURE 48
Register Rename Instruction Steer
Data Read Select
Wake-Up
Drive Execute
Instruction Allocation Stage
InstructionIssue Stage
Time
Figure 4.13: Pipeline cycles of the instruction allocation and instruction issue stagesin the CG-OoO processor.
Ready Instruction arbitration protocol as seen in many previous architecture includ-
ing the Alpha 21264 [27]. My choice of using the Oldest Ready Block arbitration
protocol allows comparing, as much as possible, two execution models that attempt
to prioritize instruction issue base on the program critical path. Discovering whether
the Oldest Ready Block arbitration protocol is the optimal instruction issue model
for the CG-OoO remains a future research topic.
Figure 4.14A shows an Instruction Queue holding nine operations belonging to a
code block. The strike-through entries correspond to operations that have already
been issued and executed. The IQ Head Pointer is pointing at the upcoming instruc-
tion to be pushed to the Head Bu↵er shown in Figure 4.14B. Because the HB can
hold up to three operations in this figure, operations {3, 4, 5} are the next set of
instructions moving to the HB; their static fields including Opcode, Immediate, Des-
tination Register Identifier, Valid Bit move to the HB RAM array and their Source
Operand Identifiers and Valid Bits move to the CAM arrays; all CAM table Ready
bits are initialized to zero. In the same cycle, source operands access their register
file(s) to load their CAM array Data entries and set their Ready fields. If the value of
a source operand is not available in a register file, the instruction waits in the Head
Bu↵er until its data is computed. The wakeup unit is responsible for updating the
source operands not found at register file lookup time.
Wakeup Pipelining To avoid stalls in the issue logic, wakeup events can be
pipelined as shown in Figure 4.15A. Upon selecting an operation, with predictable
CHAPTER 4. SYSTEM ARCHITECTURE 49
Src. Reg. 1R-GLBR-GLB
R-GLBR-GLBR-LOC
R-GLBR-GLB
R-LOCR-GLBR-GLB
W-LOCW-GLB
W-LOCW-GLBW-GLB
ImmImm
ImmImmImm
OpcodeOpcode
OpcodeOpcodeOpcode
Reg
ID
Dat
a
Src. Reg. 2
Wakeup SupportCAM Tables
Instruction Queue (IQ)
A
R-GLBR-GLB
R-GLBR-GLB
W-LOCW-LOC
ImmImm
OpcodeOpcode
Val
id
Rea
dy
Reg
ID
Dat
a
Val
id
Rea
dy
Head Buffer (HB)
IQ Head
IQ Tail
10
432
765
R-GLBR-GLBW-LOCImmOpcode
B
Opc
ode
Reg
ID
Val
id
Imm
RAM Table
R-GLBR-GLBW-LOCImmOpcode8
Figure 4.14: (A) Instruction Queue holds nine operations. The strikethrough opera-tions refer to instructions already issued and completed. The operations in green areat the head of IQ and about to be dispatched to HB. R refers to read operands andW refers to write. The dark gray fields are global source operands and the light grayones are local. (B) A Head Bu↵er with three entries. For simplification purposes,this figure assumes a maximum of two source operands per operation, similar to theMIPS ISA. The CAM table stores data read from register files before issue and alsosupports associative search for the wakeup unit.
latency, one of the following two scenarios happen. (A) if the destination register of
the selected instruction is local, it signals a wakeup message to its own BW CAM ta-
bles; (B) if the destination register of the selected instruction is global, it broadcasts a
wakeup message to all busy BW’s. Then, instructions with dependent operands would
update their corresponding Ready bits. If a dependent instruction has all the rest of
its operands ready, it would be ready for issue. If issued, the data corresponding to
its dependent operand would be forwarded to it just before the execution stage.
As shown in Figure 4.15B, the wakeup process is slightly di↵erent for operations
that produce results with an unpredictable latency. Such an operation would wakeup
its consumers after it passes through the Execute stage. Notice, however, that the
consumer instruction may be moved to the Head Bu↵er far in advance. Figure 4.15B
shows an example case where a consumer is the instruction immediately after its
producer in the same BW. Once the producer leaves the Head Bu↵er, the consumer
replaces it while accessing the register file to read its source operands. However, the
producer wakes up the consumer after its execution completes. Assuming all the rest
CHAPTER 4. SYSTEM ARCHITECTURE 50
Producer(Unpredictable
Latency)
Producer(Predictable
Latency)
Data Read Select
Wake-Up
Drive Execute
Time
Data Read Select
Wake-Up
Drive Execute
Data Read Select
Wake-Up
Drive Execute
Data Read Select
Wake-Up
Drive Execute
Execute Write Back
Write Back
Write Back
Write Back
Unpredictable Latency
Consumer
Consumer
A
B
Data ForwardingWakeup Signal
Wakeup Signal
Waiting for Wakeup Signal
Figure 4.15: (A) When instructions have predictable execution latency, wakeup eventsare pipelined and data-forwarding is used to transfer intermediate results betweenoperations. (B) When instruction latency is unpredictable, wakeup events updatethe source operands before issue.
of its source operands were ready, the consumer is marked Ready and selected for
execution.
Issue Stage Energy E�ciency The issue model in the CG-OoO processor is
a hybrid solution between the OoO Pre-Issue and Post-Issue models presented in
Chapter 2. It reads register file data for operations in the HB only thereby (A)
avoiding the post-issue register file read cycle and (B) saving the pre-issue large data
storage overhead by only storing operations in the HB’s; the number of operations
in all HB’s is a fraction of all in-flight instructions. As a result, this issue model is
as fast as the OoO pre-issue model, and more energy e�cient than both models. Its
sources of energy e�ciency are:
1. In each BW, the wakeup unit uses a small Head Bu↵er storage space to hold
operand data. Operations stored in the Instruction Queue are not involved in
the wakeup process.
2. The wakeup unit accesses small CAM tables to search for source operands. For
instance, in a CG-OoO processor with 8 BW’s, each with 3 HB entries, the
CHAPTER 4. SYSTEM ARCHITECTURE 51
wakeup unit accesses 48 CAM source operand entries. The OoO processor,
however, uses a 128-entry Instruction Window to search for ready operands.
3. Local write operands wakeup source operands associated with their own BW
only. This limits the wakeup scope to the two CAM tables in the same BW.
In the pre-issue model, register file lookup happens at the register rename stage
making the number of required register file read ports twice the decode width. In the
CG-OoO model, however, register lookup happens right before operations enter the
HB structure, suggesting the following number of read ports required for the GRF in
the worst case (Equation 4.4):
GRFRd�Port�Count
= 2⇥HBSize
⇥ BWCount
(4.4)
By design, this number is larger than twice the decode width. For example, in a
4-wide OoO superscalar machine, the decode width is 4, making the required number
of read ports 8. To produce competitive performance, as shown in Chapter 6, the
CG-OoO architecture is expected to have as many as 8 BW’s, each holding 3 HB
entries, making the number of required read ports be 48. Clearly, such large number
of ports is not feasible in practice. To mitigate this problem, two design solutions
are considered. First, the presence of LRF’s reduces the GRF lookup pressure by
about 30%. Second, the physical register file is divided into multiple small segments,
each having its own dedicated ports. In the example provided here, if each BW
holds an eighth of the GRF and each segment has as many as 4 read ports, the total
number of GRF read ports would add up to 32. Figure 4.16 illustrates the micro-
architecture design to access 8 GRF segments; the three most significant bits of the
register identifier would be used to select the register segment and the remaining bits
would be used to read an entry within the selected GRF segment. Notice segmenting
register files is also beneficial from an energy standpoint as each register value read /
write consumes an eighth of the energy consumed on a unified register file.
While the register file segmentation enables energy e�cient register file manage-
ment, this technique is independent of the CG-OoO execution model and may be
CHAPTER 4. SYSTEM ARCHITECTURE 52
REG [0-5]
REG [5-8]
GRFS0
GRFS1
GRFS2
GRFS3
GRFS4
GRFS5
GRFS6
GRFS7
Figure 4.16: GRF segment access demultiplexer to access one of eight GRF segments.The three most significant bits of the register identifier are used to select a segmentand the rest are used to index into the GRF segment.
separately applied to existing architectures. The pros and cons of register file seg-
mentation are discussed by Tseng et al. [54].
It is possible that a GRF segment becomes a hotspot when instructions from
several BW’s would want to read a segment in the same cycle. In such cases, due
to a structural hazard, the processor postpones dispatching some instructions from
their IQ to HB to a later cycle. Chapter 6 shows the combination of hardware
and software solutions used here eliminate the hotspot condition in the benchmark
applications evaluated.
Operand Data Forwarding Data forwarding hides instruction issue stall cycles
by providing source operands to the next instruction immediately after the data is
generated. Figure 4.15A shows the e↵ect of data forwarding in supporting the wakeup
process. CG-OoO supports data forwarding by delivering operands between any two
execution units. In Figure 4.15B, the forwarding unit stores the recently produces
data into the HB entry of any woken up consumer (as well as into the corresponding
register file).
Figure 4.17 illustrates the interconnection network model between EU’s belonging
to di↵erent execution clusters. Forwarding between directly connected EU’s is one
cycle. The communication latency between EU’s more than one hop away from
each other is expected to be two or more cycles. However, the evaluation results
in Chapter 6 makes a simplification assumption that all forwarding communications
CHAPTER 4. SYSTEM ARCHITECTURE 53
BW
EU EUEU EU
BW BW BW
D-Cache
L2
BW BW BW BW BW BW BW BW
I-Cache
BW BW BW BW
EU EUEU EU EU EUEU EU EU EUEU EU
FWD
FWD
FWD
FWD
Figure 4.17: The interconnection network connecting EU clusters together. Here,four EU’s serving their corresponding four BW’s form a cluster; in each cluster, EU’sare connected to each other via point-to-point links. Clusters are separated by theircolor, and are connected together through a higher network fabric. Thinner wireconnections (in blue) enable data forwarding between EU’s.
take a single cycle.
4.2.7.2 Issue Scheduler
To leverage dynamic instruction execution in the CG-OoO processor, three complexity
e↵ective instruction issue techniques are evaluated:
1. Single-Issue Scheduler
2. Multi-Issue Sequential Scheduler
3. Multi-Issue Skipahead Scheduler
This sections describes the pros and cons of each issue technique in detail, and
Chapter 6 evaluates the performance and energy behavior of each technique sepa-
rately.
CHAPTER 4. SYSTEM ARCHITECTURE 54
1. sll r1, r2, #32. lw r0, r13. add r3, r4, #14. bne r2, r3, LOOP
1. sll r1, r2, #32. lw r0, r23. add r3, r4, #14. bne r2, r5, LOOP
(A) (B)
Figure 4.18: (A) A code snippet where four consecutive operations do not have registerdata dependency. (B) A code snippet where two data-dependencies exist; operation2 dependents on the result of 1, and operation 4 depends on the result of 3. Thedependency registers are highlighted in bold.
Single-Issue Scheduler The single-issue scheduler model refers to the case where
each BW is allowed to issue one instruction per cycle. This model serves as the
most energy e�cient CG-OoO processor I evaluate in this work. In this model,
each Head Bu↵er has one entry. It eliminates the need for CAM lookup during the
wakeup process and requires the instruction selection logic check the readiness of one
instruction per BW. However, this model may throttle the issue of useful work from a
single BW. For instance, when two back-to-back operations are not data dependent,
they can potentially be issued in the same cycle, but this model would not permit
the issue of both at once.
Figure 4.18A illustrates an example code sequence where all operations are in-
dependent from each other. Figure 4.18B illustrates a code sequence with two data
dependency links between operations 1, 2 and operations 3, 4. While the potential
ILP of the code in Figure 4.18A is twice that of Figure 4.18B, this issue model would
at most deliver ILP of 1 for both. That is, the single-issue model issues one instruc-
tion from the code sequence shown in Figure 4.18A even though all four operations
may be ready to be issued at the same time.
Multi-Issue Sequential Scheduler The multi-issue sequential scheduler general-
izes the single-issue model by allowing more than one operation to be issued from
each BW. In this model, each Head Bu↵er holds multiple entries and manages a
multi-entry CAM table. The scheduler arbitrates through all blocks and chooses the
CHAPTER 4. SYSTEM ARCHITECTURE 55
BW with the oldest dynamic code block. It then issues multiple instructions from
that queue in a sequential order before issuing from younger BW’s. As soon as all
available EU’s become busy, the arbiter refrains from issuing more operations from
the remaining BW’s.
This scheduling model allows in-order issue of multiple operations from the same
HB. For example, assuming a three-entry HB, the case in Figure 4.18A would allow
issuing operations {1, 2, 3} in the same cycle and {4} in the next cycle. The case
in Figure 4.18B allows issuing operations {1}, {2, 3}, {4} in three separate cycles.
Operation 2 is data-dependent on operation 1, but operation 3 is independent from
both 1 and 2. However, because operation 2 cannot be issued along with operation 1,
it prevents younger instructions from being issued as well thereby delaying the issue
of operation 3.
Multi-Issue Skipahead Scheduler The Skipahead model takes instruction schedul-
ing further by allowing limited out-of-order issue of operations. Even though instruc-
tions within each BW may be issued out-of-order, the selection and wakeup energy
overhead is not any higher than the multi-issue sequential scheduler; however, the
skipahead average performance is 41% higher. Using this model, assuming a three-
entry HB, operations in Figure 4.18B would be issued as {1, 3} followed by {2, 4}.This model requires data-dependence checking when skipping across operations.
For instance, in Figure 4.18B, before issuing operation 3, its operands must be checked
for true and false dependencies against operands of skipped operation 2. If no de-
pendency exists, operation 3 is permitted to be issued out-of-order. To architect a
complexity e↵ective dependency checker, a simple XOR logic is used to cross refer-
ence younger operations against stalling older operations. Figure 4.19 illustrates the
operation dependencies logic. The dark green XOR gates check for write-after-read
(WAR) dependencies and light green gates check for read-after-write (RAW) depen-
dencies. If any of the operand-checks returns a non-zero value, skipahead issue will
not be allowed.
CHAPTER 4. SYSTEM ARCHITECTURE 56
Opcode DST REG
SRC REG1
SRC REG2
Opcode DST REG
SRC REG1
SRC REG2
Opcode DST REG
SRC REG1
SRC REG2
OR
Ready?
OR
Ready?
Head Buffer
WAR Dependency Check
RAW Dependency Check
Figure 4.19: Head Bu↵er logic required to support the Skipahead issue model. Thedark green XOR gates are used to handle write-after-read (WAR) hazards, the lightgreen XOR gates are used to handle read-after-write (RAW) hazards. Write-after-write (WAW) hazards are automatically handled by these gates. If any XOR opera-tion produces a non-zero value, the corresponding instruction would not be permittedto skip ahead.
4.2.8 Memory Stage
The memory stage, in this work, consists of a Load-Store Unit (LSU) that allows
speculative issue of load operations across stores. Unlike most units described in this
chapter, the LSU operates at instruction granularity. To maintain memory access
order, memory instructions are inserted into LSU immediately after the register re-
name stage. Each memory operation stores its Block Sequence Number in its load /
store queue entry to use at the commit stage or squash (see Section 4.2.9). If the LSU
runs out of space to hold memory operations, it notifies the front-end to stall fetch.
Apart from the memory mis-speculation model in the CG-OoO processor, the rest
CHAPTER 4. SYSTEM ARCHITECTURE 57
of the LSU matches that of conventional OoO processors. The same holds true for
the underlying processor cache hierarchy, the Miss Status Holding Registers (MSHR),
and the memory disambiguation predictor.
Memory operations may exist in any position within the code block. Whenever a
load operation mis-speculates due to a data-conflict with an older store operation, it
triggers a squash event that flushes all younger code blocks including the code block
within which the squash was triggered. For instance, if operation 2 in Figure 4.18A
were to trigger a memory mis-speculation event, then the entire block including op-
eration 1 would be squashed. Notice because operation 1 is older than 2, it would
not have been squashed in the OoO execution model. In this work, the squash of
useful operations is called wasted squash. To avoid wasting useful work, the compiler
schedules load operations as early, in the code block, as possible so that they are the
first instructions to be issued. Doing so also improves the memory level parallelism
of the program. In Chapter 6, the impact of wasted operations on performance and
energy is evaluated.
4.2.9 Write-Back & Commit Stage
Figure 4.20 shows the contents of a BROB entry. It holds the Block Sequence Number
(SN), Block Size, and all Global Write (GW) register operands in the block. The
Block Size entry is initialized by the BlkSize field of the head. The GW operands are
inserted into a BROB entry one by one as instructions (with global write registers)
move from the Register Rename Stage towards their BW. In this Section, the utility
of each field in the BROB is discussed.
4.2.9.1 Write-Back Stage
Once an instruction completes its operation, it writes its results into either a desig-
nated register file entry (global or local) or into the store queue. If the destination
operand is a local register, the instruction accesses its corresponding BW to update
the LRF; this can be done through a BW ID tag each operation carries through the
execution pipeline.
CHAPTER 4. SYSTEM ARCHITECTURE 58
Block SN Block Size GW0 GW1 GW2 GW3 GW4 GW5 GW6 GW7 GW8 GW9
Figure 4.20: Each Block Re-Order Bu↵er (BROB) entry consists of a Block SequenceNumber (SN) field, a Block Size field, and a number of Global Write operand fields.To prevent structural hazards, the compiler controls the number of permissible globalwrite operands per code block.
Figure 4.20 shows each BROB entry holds a Block Size field; this field is designed
to track the number of completed operations for each in-flight code block. Upon each
instruction complete, the corresponding Block Size entry in BROB is decremented.
Once the value of a Block Size reaches zero, the corresponding block is completed.
4.2.9.2 Commit Stage
A block is committed when it is completed and is at the head of the BROB FIFO.
Upon commit, all GRF operands modified by the block must be marked Architectural
in the commit-RAT. To do so, the committing block uses its GW fields to update the
renaming state of its physical registers to Architectural. If the number of commit-RAT
ports is smaller than the number of global register operands, the commit process can
take multiple cycles.
Upon a commit event, store operations belonging to the committing code block
must retire. To do so, the “commit” pointer in the Store-Queue moves to the youngest
store operation belonging to the committing block. This store operation is found via
searching for the youngest store operation whose Block SN matches the committing
Block SN. Recall, each entry in the Store Queue has a Block SN field.
4.3 Squash
The squash mechanism is a hardware solution to roll back the processor state into its
most recent non-speculative state by undoing incorrect modifications to the program
context due to control and memory speculations. Control mis-speculation happens
CHAPTER 4. SYSTEM ARCHITECTURE 59
when the BPU outputs incorrect program control predictions leading to fetch and
execution of wrong-path code. Memory mis-speculation happens when the Load-Store
Unit makes false predictions about the latest value of a memory location leading to
incorrect data computations.
The squash mechanism influences multiple tables and logic units within the pro-
cessor pipeline. Previous sections elaborated on some aspects of the squash process.
In this section, all key squash events are elaborates concretely. Upon a squash event,
the following events take place:
• The BPU history queue and the Block PC Bu↵er flush their content corre-
sponding to the wrong-path predictions. The code block PC resets its value
to the start of the right path. In case of a control mis-speculation, the right
path would be the opposite side of the control operation. In case of a memory
mis-speculation, the right path would be the start of the mis-speculated block.
• All BW’s holding younger code blocks than the mis-speculated operation flush
their IQ content, reset all Head Bu↵er entries to zero, and mark all their LRF
registers invalid.
• The Load-Store Unit flushes all operations corresponding to the younger code
block than the mis-speculated operation. This is done by comparing the mis-
speculated Block SN against the Block SN of each memory operation.
• The BROB entries corresponding to younger code block than the mis-speculated
operation are flushed. The remaining blocks will complete their execution and
commit.
• When the BROB is completely empty, commit-RAT holds the up-to-date ar-
chitectural state of all global registers. At this stage, the commit-RAT updates
the fetch-RAT. Then, the program restarts fetch and continues execution as
normal.
The general squash protocol for memory and control mis-speculation is the same.
The only di↵erence between the two cases is that in memory mis-speculation, the block
CHAPTER 4. SYSTEM ARCHITECTURE 60
triggering mis-speculation is flushed along with all younger blocks but in control mis-
speculation it does not. This is because control operations define the end of a code
block and any control mis-speculation only influences blocks succeeding the control
operation. Load operations, on the other hand, can exist in any location within the
code block, and any time a memory mis-speculation happens, the load operation itself
must be squashed.
4.4 Chapter Summary
In this section, I discussed the architectural details associated with each pipeline
stage and provided insight as to how each unit delivers energy e�cient and high-
performance computing support to the execution model.
The key energy saving opportunities discussed in this chapter are:
• The selective branch prediction access model, where only head operations are
permitted to do BPU lookups
• The compiler support for local register operands that enables skipping the reg-
ister renaming stage
• The complexity e↵ective Block Window design where only a small numbers of
instructions, present in the Head Bu↵er, participate in the instruction wakeup
and selection process
• The segmentation of the Global Register File into smaller SRAM tables to read
and write data
• The use of Local Register Files to allow data transfer on short wires (contrary
to the GRF), and to allow accessing small SRAM tables to read and write data
• The use of roughly 10⇥ smaller re-order bu↵er to maintain runtime program
order
• The compiler support for producing e�cient instruction sequences for each code
block via block-level list-scheduling
CHAPTER 4. SYSTEM ARCHITECTURE 61
While the CG-OoO processor provides a unique framework to exploit all the above
energy e�ciency opportunities, it is important to note these features are designed to
remain as decoupled from each other and from the execution model as possible. For
instance, the energy e�cient branch prediction model in this work may be replaced
with the conventional branch prediction model without disturbing the CG-OoO exe-
cution model. Likewise, the absence of local operands in a binary would not disturb
the execution model. This is particularly valuable for supporting binary backward
compatibility to run programs with binaries produced for existing OoO processors.
Chapter 5
Methodology
The evaluation setup for this work consists of four major software modules. This
chapter describes the details of each module (listed below) and lays out the foundation
for the evaluation analysis discussed in Chapter 6.
• A binary translator and compiler back-end
• A functional emulation engine, based on the Intel Pintool [31]
• A cycle-by-cycle timing simulator
• A cycle-by-cycle energy estimation framework
Figure 5.1a shows the code flow through the compilation framework. Part (a)
shows the compiler back-end producing a .s file that is later used by the simulation
framework in Part (b). Part (b) shows three major sections: the Functional Emulator,
the Timing Simulator, and the Energy Model. The Functional Emulator instruments
the original x86 binary code while running it on the native processor. The dynamic
instructions generated by the emulator are reformatted into an internal Instruction
Set Architecture (ISA) and sent to the Timing Simulator. The Timing Simulator has
the model for the out-of-order (OoO), coarse-grain out-of-order (CG-OoO), and in-
order (InO) processors. The Energy Model produces per-access energy values which
the timing simulator utilizes to compute the total energy consumption of di↵erent
hardware units throughout the processor pipelines.
62
CHAPTER 5. METHODOLOGY 63
5.1 Compiler
The compiler performs Local Register Allocation and Global Register Allocation as
well as Static Block-Level List Scheduling for each program basic-block. This means
the output ISA di↵ers from the gcc generated x86 ISA. Chapter 4 discusses this ISA
in detail.
Figure 5.2 shows the compiler code processing stages. The code binary is first
generated using gcc (with -O3 optimization flag). The corresponding assembly code
is then parsed to reconstruct the program intermediate representation (IR) control
flow graph (CFG). The dominance frontier is constructed through the five sub-steps
shown in this figure. The dominance frontier sets are used to rename register operands
into static single assignment (SSA) format. Then, the liveness analysis unit finds the
live-range of each SSA operand via tracking register definition sets and use sets. The
CFG is then used to perform block-level list-scheduling followed by register allocation
and dead code elimination. The code generated for the CG-OoO processor has two
register allocation phases: local allocation and global allocation. The code generated
for the OoO and InO cores uses global register allocation only.
Block-level list-scheduling refers to performing list-scheduling on each individ-
ual basic-block separately. This avoids hoisting operations across control boundaries
speculatively; no speculative static instructions means no energy overhead for exe-
cuting additional (speculative) instructions at runtime. The static scheduler does not
perform speculative load-store reordering. Thus, store-to-load dependency detection
is not a concern.1 To identify the block critical path, the compiler assigns latency
numbers to each operation; the longest instruction sequence would then be marked
as the block critical path. When the compiler finds more available instructions than
the number of available execution units, it selects instructions in the order of their
path latency (from highest to lowest latency).
1The only instructions with unpredictable latency are memory operations. The compiler assignsL1-cache access latency (i.e. 4 cycles) to all memory operations.
CHAPTER 5. METHODOLOGY 64
Functional Emulator
(Pintool)
Ins Que Timing Simulator
EnergyModel
Perf &EnergyOutput
x86 Code
(SPEC Int 2006)
Config
Compiler(new ISA)
.scode
Config
GCC(-O3)
(a)
(b)
Figure 5.1: The evaluation framework in this work consists of three main components:the compiler back-end, the simulation infrastructure, and the energy model. The .sfile holds instructions in the ISA format.
CHAPTER 5. METHODOLOGY 65
x86 Binary Parser
Intermediate Representation
Dominators Construction
Strict Dominators
Immediate Dominators
Dominator Tree
Dominance Frontier
Static Single Assignment
Liveness Analysis
Local Register Allocation
Global Register Allocation
List Scheduling
Dead Code Elimination
Output Static Instructions(CG-OoO ISA)
Figure 5.2: The compiler back-end diagram. The first stage parses the programassembly for constructing the intermediate representation and the final stage outputsthe same code as a .s file used by the runtime simulation engine.
CHAPTER 5. METHODOLOGY 66
PinPoints / Fast-ForwardingEngine
ISA Exchange
(x86 → Internal ISA)
CodeParser
Ins Que
x86 Code
(SPEC Int2006)
PARS ISAInstructionDictionary
.sCode
Functional Emulator Software Architecture
Program Instrumentation
Unit(Pintool)
Branch Prediction Unit
Wrong-PathContext & Memory
Log Handler
Wrong-PathException Handler
GCC(-O3)
SimulationConfig
Figure 5.3: The software blocks constructing the Functional Emulator.
5.2 Functional Emulator
Figure 5.1b shows the Pintool-based functional emulator built to instrument x86 op-
erations. Figure 5.3 shows the di↵erent software components inside the Functional
Emulator. Each of these components are discussed here. This unit generates x86
instructions and dynamically reformats them according to the code information pro-
vided by the compiler (i.e. .s code file). In case of the CG-OoO execution model, the
static schedule of instructions in each basic-block is enforced by the one provided by
the compiler. The other two processors use the original code schedule produced by
gcc.
The functional emulator and the timing simulator are setup as a producer thread
and a consumer thread respectively. Once instructions are generated and reformatted
by the emulator, they are inserted into the instruction queue (i.e. Ins Que). The
timing simulator consumes instructions as they appear in the queue.
CHAPTER 5. METHODOLOGY 67
The simulation framework supports Simpoints [18] through the PinPoints [41]
API. Alternatively, it supports a fast-forwarding mode in which the simulator skips
instrumenting the initial few billion instructions before it emulates several tens of
millions of instructions. The results provided in Chapter 6 use the fast-forwarding
mode; all evaluations are configured to fast-forward 2 billion instructions, warm-up
the simulator with 2 million instructions, and run energy and performance evaluations
on the following 20 million dynamic instructions.
In this study, I use the 2bc-gskew branch predictor unit (BPU) proposed by Seznec
et al. [48]. Branch prediction is done by the emulator (instead of the timing simula-
tor). Branch prediction updates are done immediately at prediction time. To faith-
fully model the e↵ect of control mis-speculation, the emulator supports executing code
from the wrong-path. Wrong-path execution is enabled through the Pintool context
and memory address logging API. Figure 5.4 shows the state transition diagram for
switching between wrong-path and right-path states. It also shows the sequence of
events happening upon each state transition event and highlights the key Pintool API
at each step. At every state transition between the right and wrong-path, register
context and memory data must be tracked. Register context information is tracked
through the Pintool API (PIN SetContextReg (), PIN SaveContext ()). The mem-
ory data, however, is maintained as a data log in the emulator analysis code. Memory
logging is enabled only during the right-path state; at the start of every right-path
state, the memory log from the previous right-path state is restored to undo any
potential memory overwrite during the wrong-path state. The memory log is then
reset to prepare for logging memory writes during the current right-path state.
In practice, upon every branch mis-prediction event, some number of wrong-path
instructions are fetched from the instruction cache; this number varies based on sev-
eral parameters including the processor width, number of instructions in the pipeline,
and the type of mis-speculated control instruction. In this study, as a design simplifi-
cation, the number of wrong-path instructions on control mis-speculations is set to a
fixed number. The number of wrong-path instruction is a configuration parameter in
the emulator. This number is set to 20 for the results reported in Chapter 6. About
20 wrong-path instructions are required to resolve a mis-speculation event given a
CHAPTER 5. METHODOLOGY 68
Right-Path → Wrong-Path Context Switch
Save Right-Path ContextPIN_SaveContext ()
Switch Instruction Pointer Register to Wrong-PathPIN_SetContextReg ()
Start Execution at Wrong-PathPIN_ExecuteAt ()
WrongPath
RightPath
Control Mis-speculation
Memory Write LoggingPIN_SafeCopy ()
20 Instruction Limit OR
Exception
Disable Memory-Write Logging
Wrong-Path → Right-Path Context Switch
Enable Memory Write LogPIN_SafeCopy ()
Restore Memory-Write LogPIN_SafeCopy ()
Reset Memory-Write Logger
Start Execution at Right-Path ContextPIN_ExecuteAt ()
Figure 5.4: The wrong-path / right-path state transition diagram along with thesequence of events that take place at every transition.
4-wide processor with 5 pipeline stages between the Decode and Execute.
In most cases, the emulator succeeds at executing 20 wrong-path instructions,
but in certain cases, before reaching the threshold, it is retracted by the Wrong-Path
Exception Handling unit. For example, if the wrong-path code is the catch side of
an exception handling block, or if the wrong-path code accesses some unallocated
memory address, it may retract from the wrong-path earlier. Figure 5.5 shows the
average size of wrong-path code sequences for SPEC Int 2006 benchmarks. In most
cases, the wrong-path unit reaches the expected number of instructions.
5.3 Timing Simulator
The timing simulator is a consumer thread created by the emulator thread. The
emulator runs ahead and inserts instrumented instructions into the Ins Que. When
CHAPTER 5. METHODOLOGY 69
0"
2"
4"
6"
8"
10"
12"
14"
16"
18"
20"
CG*OoO" OoO" InO"
Average'Num
ber'o
f'Wrong0Path'Instruc7on
s'
Average'Number'of'Instruc7ons'on'the'Wrong0Path'(SPEC'Int'2006'Benchmarks)'
Figure 5.5: The average number of operations on the wrong-path for each proces-sor model. To produce these results, the average number of program wrong-pathoperations for all SPEC Int 2006 benchmarks are averaged.
the number of dynamic instructions reaches a maximum threshold (in this study
1,000 instructions), it would switch execution to the timing simulator to consume the
instructions in the Ins Que using the Pin Semaphore API. Then, the timing simulator
thread would consume the instructions in the Ins Que before transferring the control
back to the producer thread. The Ins Que is provisioned to hold an arbitrary number
of instructions. Here, it holds a thousand instructions as shorter queue sizes lead
to high thread context switching overhead which ultimately impacts the simulation
time.
The timing simulator consists of a cycle-by-cycle pipeline architecture for the
OoO, CG-OoO, and InO execution models as well as their cache memory hierarchy.
Tables 5.1 and 5.2 outline the system configurations used in this study. Figure 5.6
shows the pipeline structure simulated for each of the three processor models.
Each major pipeline stage (i.e. branch prediction, fetch, decode, rename, issue,
execute, commit) is a separate class object. Instruction objects flow through the
stages via instruction queuing class objects called port. The lifetime of instructions
in a port depends on the latency of the preceding stage. For example, if the Decode
stage is configured to take three cycles to complete decoding an operation, each
CHAPTER 5. METHODOLOGY 70
operation will only become available to the Issue class three cycles after it arrives at
the Decode stage. In the mean time, the instruction stays in the Decode outgoing
port. The port class can be configured for receiving and delivering an arbitrary
number of instructions per cycle. It is, however, expected to be as wide as the width
of the pipeline stages it connects.
The simulator timing is tracked via an internal clock. At every cycle, all pipeline
stages, in reverse order, get a chance to process one cycle worth of instructions. For
instance, a 4-wide Execute stage would complete executing up to four add operations
in one cycle.
As mentioned earlier, control speculation happens in the producer thread. Upon
every control operation, the BPU produces a prediction which is then compared
against the instrumented outcome from Pin. In case of a mis-prediction, the wrong-
path unit changes the execution context to execute wrong-path instructions as de-
scribed earlier; otherwise, the emulator resumes on the right-path. In either case,
emulated instructions are reformatted to the internal API provided by the compiler
and inserted into the Ins Que. Mis-speculated operations and wrong-path operations
are marked with a mis predict flag so that when they are processed by the timing
simulator, they can be identified by the squash handling unit.
Figure 5.7 shows the states transition diagram for squashing instructions in the
timing simulator. When the simulator detects a mis-speculated instruction, it moves
from the Normal Execution state to the Flush Younger Ops state; as the state name
suggests, all instructions younger than the mis-speculated operation are removed from
the re-order bu↵er (ROB); wrong-path instructions are deleted and the rest of the
instructions are pushed back to the Ins Que to wait until the simulator returns to the
Normal Execution state. After all younger instructions are flushed from the ROB,
the simulator moves to the Drain Older Ops state to commit all instructions older
than the mis-speculated instruction. As expected, mis-speculated control operations
are committed along with older operations, but mis-speculated memory operations
are pushed back to the Ins Que for re-execution with correct and up to date data.2
2Notice the squash handling mechanism discussed here only concerns the timing simulator andnot the functional emulator.
CHAPTER 5. METHODOLOGY 71
Shared Parameters SizesISA x86Technology Node 22nmSystem clock frequency 2GHzL1 Cache size, associativity, latency 32KB, 8, 4 cyclesL2 Cache size, associativity, latency 256KB, 8, 12 cyclesL3 Cache size, associativity, latency 4MB, 8, 40 cyclesMain memory latency 100Instruction Fetch Width 1 to 8Branch Predictor Hybrid- G-Share Size 8Kb- Bimodal (BIM) 8Kb- Meta 8Kb- Global Pattern Hist 13bBTB Number of Entries, associativity 4096, 8BTB Tag Size 16b
Table 5.1: System parameters shared between all core architectures
Fetch Decode Rename Issue Schedule RegRd Execute Mem WB Commit
ROB
RFInst WinInst Win
RFRegisterRenameDecoder
BPPC
ICache
OoOPipeline
CG-OoOPipeline
Bypass DCache
LSU
Fetch DecodeRename
Ins Steer Schedule Reg Rd Execute Mem WB / Complete
BROB
DCache
BWDecoder
BPPC
ICacheBypass
Commit
GRFRegister Rename
LRF
LRF
GRF
BW
BW
SCHEDULE
LSU
Blk Alloc
RegRd Execute Mem WB
RF DCache
SQRF
Decoder
BPPC
ICache
InOPipeline
Bypass
Fetch Decode
SCHEDULE
Figure 5.6: Processor pipeline stages modeled in the timing simulator.
CHAPTER 5. METHODOLOGY 72
In-Order Processor SizesPipeline Depth 7 cyclesInstruction Queue 8 entries, FIFORegister File 70 entries (64bit)Execution Unit 1-8 wideOut-of-Order ProcessorPipeline Depth 13 cyclesInstruction Queue 128 entries, RAM/CAMRegister File 256 entries (64bit)Execution Unit 1-8 wideRe-Order Bu↵er 160 entriesLoad/Store Queue 64 LQ entries, 32 SQ entries, CAMCoarse-Grain Out-of-Order ProcessorPipeline Depth 13 cyclesBlock Window (BW) Count 3-18Instruction Queue / BW 10 entries, FIFOHead Bu↵er / BW 2-5 entries, RAM/CAMExecution Unit / BW 1-8 wideLocal Register File / BW 20 entries (64bit)Global Register File (GRF) 256 entries (64bit)GRF Segments 1-18Number of Clusters 1-3 clustersBlock Re-Order Bu↵er 16 entriesLoad/Store Queue 64 LQ entries, 32 SQ entries, CAM
Table 5.2: System parameters for each individual core
CHAPTER 5. METHODOLOGY 73
Figure 5.8 shows a simple dynamic code sequence example. Instructions 1 through 5
are already fetched and allocated in the re-order bu↵er (ROB). Instructions 6 through
8 are pending fetch in the Ins Que. Instruction 2 is a branch that is marked as mis-
speculated by the BPU in the emulator, and instructions 3 and 4 are from the wrong-
path. When instruction 2 reaches the Execute state, the processor detects that it was
mis-speculated. Then, it initiates the squash sequence. During the Flush Young Ops
state, instructions 3 and 4 are deleted and 5 is pushed back to Ins Que. During the
Drain Older Ops state, instructions 1 and 2 are completed and committed.
To handle precise exceptions, the processor is capable of issuing instructions in-
order upon exceptions. Once recovered from the exception, the program processor
resumes its normal execution model.
5.4 Energy Model
In this study, the energy model supports table access, cache access, wire, stage reg-
ister, and execution unit energies. The energy model estimates per-access dynamic
energy and per-cycle static energy consumption for each unit. The energy model
also estimates the area for each unit. The remaining parts of the simulator, such as
logic blocks and control units, are assumed to have similar energy costs for both the
baseline OoO and the CG-OoO model, and to have secondary e↵ect on the overall
system energy.
The measurement techniques used to estimate the energy for each of the above-
mentioned units is described next. All techniques produce per-access dynamic and
static energy numbers. These numbers are imported to the timing simulator as con-
figuration parameters (see Figure 5.1). Then, the timing simulator computes the total
dynamic energy consumption for each hardware unit; upon each access to the unit,
the simulator increments the energy consumption by the per-access dynamic energy
provided by the energy model. The total static energy consumed by each unit is
estimated at the end of the simulation when the simulator multiplies the number of
simulation cycles by the per-cycle leakage energy of that unit.
CHAPTER 5. METHODOLOGY 74
NormalExecution
FlushYounger
Ops
DrainOlder Ops
EmptyROB
Mis-speculationDetected
Wait until allops commit
Start
Done flushing younger ops from
pipeline
Figure 5.7: The state transition diagram for the squash mechanism in the timingsimulator. This unit handles both memory and control mis-speculation events.
CHAPTER 5. METHODOLOGY 75
5.LOAD
8.SUB
7.LOAD
6.ADD
4.ADD
1.STORE
3.STORE
2.BRANCH ROB
Ins. Que.
Head of Queue
Head of Queue (Older Ops)
5.LOAD
8.SUB
7.LOAD
6.ADD
ROB
Ins. Que.
Head of Queue
Head of Queue (Older Ops)
Wrong-Path OpRight Path Op
Mis-Speculated Op
BeforeSquash
AfterSquash
Figure 5.8: A simple dynamic code sequence example showing operations in the InsQue and the re-order bu↵er (ROB). Before detecting branch mis-speculation, thebranch and its corresponding wrong-path instructions are in the ROB. After thesquash is detected and handled, the ROB will be empty, and the younger right-pathoperations would be pushed back to the Ins Que. Wrong-path instructions would beeliminated from the sequence without committing.
CHAPTER 5. METHODOLOGY 76
RAM, CAM, FF, CacheModels Script
SPICESimulation
Energy-per-access & AreaOutput
.ckt
SPICEFile
.mt0
OutputFile
.yaml
ConfigFile
Energy Model Software
Figure 5.9: The evaluation pipeline to produce per-access energy numbers for di↵erentSRAM tables and flip-flop arrays using SPICE simulations. Additional steps includingarea estimation, energy scaling for di↵erent port configurations, and energy modelingfor cache structures are done by the Models Script. The YAML configuration fileholds necessary table specification parameters for the energy model (see Figure 5.11).CAM: content addressable memory, RAM: random access memory, FF: flip-flop.
5.4.0.3 Energy Model Structure
The energy model in this work, consists of three major units:
• SRAM Tables & Flip-Flop Arrays Energy Model 3
• Execution Units Energy Based on the Synopsys Design Compiler
• Wire Energy Model Based on HotSpot [23]
In this section, the details associated with each of these energy model units is
discussed with a heavier focus on the in-house energy model built for modeling the
energy of flip-flop arrays and SRAM tables.
Random Access Memory (RAM) Tables: RAM tables are designed as standard
SRAM units accessed through decoder and read through sense amplifiers. Static and
dynamic energy are generated using SPICE. After the SPICE simulation results are
produced by the Models Script, additional steps including area estimation, energy
3This energy model is built in collaboration with Subhasis Das [14] at Stanford University.
CHAPTER 5. METHODOLOGY 77
scaling for di↵erent port configurations, and energy modeling for cache structures are
done.
SRAM area is dependent on the decoder design. In this study, the decoder design
consists of a global decoder, a pre-decoder, and a local decoder. This area is computed
through Equations 5.1, 5.2, 5.3 where waddr
is the number of address bits, wlocal
is the
number of bits decoded on each local decoder, wout�word
is the number of word-bits
read on each access, L is the technology node size, Pw
, Pr
are the number of write
and read ports respectively, and �h
, �w
are table dimensions (height and width) in
lambda. Layout parameters follow the ITRS [25] standard for the 22nm technology
node.
HSRAM
= L⇥ �h
⇥ (Pw
+ Pr
)⇥ 2waddr
�w
local (5.1)
WSRAM
= L⇥ �w
⇥ (2⇥ Pw
+ Pr
)⇥ wout�word
⇥ 2wlocal (5.2)
ASRAM
= HSRAM
⇥WSRAM
(5.3)
The per-access energy numbers for all tables and pipeline registers are produced
through applying the pulse signal shown in Figure 5.10 to each SPICE circuit. Phase
I captures the dynamic switching energy and Phase II captures the static leakage en-
ergy. The dynamic and static energy generated in this step are measured for accessing
one word (i.e. 64 bits) of a SRAM table. These numbers are then post-processed to
produce final per-access estimates for di↵erent table configurations. Configuration
parameters including the number of ports, number of table entries, and row / column
dimensions, word-width, etc. are shown in Figure 5.11.
Figure 5.12a shows the normalized energy-per-access for di↵erent register file sizes,
and Figure 5.12b shows the corresponding area scaling for the same tables. Addition-
ally, Figures 5.13a and 5.13b show the normalized energy-per-access and area of a
256-entry SRAM table with three di↵erent read/write port configurations commonly
used in this study. The energy-per-access for a 256-entry SRAM table with 64-bit
words, and 8 read and 4 write ports is 7.6pJ and its area is 1.53mm2. The SRAM
CHAPTER 5. METHODOLOGY 78
0
Vdd
Time (ns)1 2 3 4
DynamicEnergy
Measurement
Static Energy
Measurement
SPICEMeasurement
Signal
Figure 5.10: The measurement signal used in the SPICE models to measure dynamicand static energy per access.
layout parameters follow the ITRS [25] standard for the 22nm technology node.
Content Addressable Memory (CAM) Tables: CAM tables are designed as
standard SRAM units accessed through a driver input module and read through
sense amplifiers. Energy estimates for CAM tables are generated through SPICE in
the same way to RAM tables. The same set of equations in the previous section are
used to estimate the area of CAM arrays.
Figure 5.14 illustrates the CAM and RAM array models used in this study. Notice
RAM and CAM tables operate in opposite ways; for the RAM table, the decoder
drives the inputs to the table through the wordlines. For the CAM table, on the
other hand, it is the searchline that is driven to trigger a table read. In other words,
wordline and serachline drive inputs while bitline and matchline drive outputs.
To estimate the CAM access energy based on the RAM models, Equations 5.4
and 5.5 are used; bl, wl, ml, and sl refer to the RAM bit-line, RAM word-line, CAM
match-line, and CAM select-line respectively, and n refers to the number of input
bits to the CAM table (i.e. 64 bits). Equation 5.5 assumes, on average, half of the
CHAPTER 5. METHODOLOGY 79
## Array Size Configurationsarray: word_width: 64 # output width in bits num_words: 256 # number of words, each word is word_width wide cin: 12 # max input capacitance in terms of lambda
arrays: num_array: 1 # number of tables
## SRAM Cell Model Parameters## lambda = min feature size = 0.5 * technology nodecell_model: rd_port: 2 # number of read ports (must be 1 at least) wr_port: 2 # number of write ports cin: 8 # wordline transistor input capacitance in lambda w: 16 # width (along wordline) in lambda h: 40 # height (along bitline) in lambda
## Local Wire Parameterswire_model: pitch: 0.14 # width + spacing in um width: 0.07 # spacing in um height: 0.26 # from next lower layer in um thickness: 0.125 # vertical dimension thickness in um
## Global Wire Parametersglobal_wire_model: eperl: 0.08 # energy/bit/mm (pJ/mm), assuming activity factor of 0.25 tperl: 0.3 # delay per mm in ns/mm
Figure 5.11: An example YAML configuration file for the energy model showingSRAM cell parameters, wire parameters, and table size configurations for a 256-entryregister file with 2 read and 2 write ports.
CHAPTER 5. METHODOLOGY 80
0%#
20%#
40%#
60%#
80%#
100%#
RF64#### RF128#### RF256#####
Normalized
+Ene
rgy+
SRAM+Energy+/+Access+(Size+Sweep)+
(a)
0%#
20%#
40%#
60%#
80%#
100%#
RF64#### RF128#### RF256#####
Normalized
+Area+
SRAM+Area+(Size+Sweep)+
(b)
Figure 5.12: Register File energy and area scaling as the number of entries grows from64 to 256. The energy-per-access and area figures are normalized to the 256-entryregister file. The energy-per-access for a 256-entry SRAM table with 64-bit words,and 8 read and 4 write ports is 7.6pJ and its area is 1.53mm2. The SRAM layoutparameters follow the ITRS [25] standard for the 22nm technology node.
0%#
20%#
40%#
60%#
80%#
100%#
2R2W# 4R4W# 8R4W#
Normalized
+Ene
rgy+
SRAM+Energy+/+Access++(Port+Sweep)+
+
(a)
0%#
20%#
40%#
60%#
80%#
100%#
2R2W# 4R4W# 8R4W#
Normalized
+Ene
rgy+
SRAM+Area+(Port+Sweep)+
+
(b)
Figure 5.13: Register File energy and area scaling as the number of entries growsfrom 64 to 256. The energy-per-access and area figures are normalized to the 8R4Wregister file. The energy-per-access for a 256-entry SRAM table with 64-bit words,and 8 read and 4 write ports is 7.6pJ and its area is 1.53mm2. The SRAM layoutparameters follow the ITRS [25] standard for the 22nm technology node.
CHAPTER 5. METHODOLOGY 81
RAM Array
Decoder
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Wordline
Bitline Bitline’
Address
Enco
der
BitlineSense
Amplifiers
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Matchline
MatchAddress
MatchlineSense
Amplifiers
Searchlines
Searchline Driver Input
Vdd
Matchline
WordlineSearchline Searchline’
Vdd
WordlineBitline Bitline’
CAM Cell
RAM Cell(a) (b)
(c) (d)
RAM Array
CAM Array
Figure 5.14: (a) a random-access memory (RAM) table; the corresponding cell modelis shown in (b). (c) a content addressable memory (CAM) table; the correspondingcell model is shown in (d).
CHAPTER 5. METHODOLOGY 82
Clk
Data
Q
Q’
Positive Edge-Triggered Flip-Flop
.subckt nand2 a b y vdd_lmp1 y a vdd_l vdd_l pmos L=1 W='Wp'mp2 y b vdd_l vdd_l pmos L=1 W='Wp'mn1 y a n1 0 nmos L=1 W='Wn'mn2 n1 b 0 0 nmos L=1 W='Wn'.ends
.subckt nand3 a b c y vdd_lmp1 y a vdd_l vdd_l pmos L=1 W='Wp'mp2 y b vdd_l vdd_l pmos L=1 W='Wp'mp3 y c vdd_l vdd_l pmos L=1 W='Wp'mn1 y a n1 0 nmos L=1 W='Wn3'mn2 n1 b n2 0 nmos L=1 W='Wn3'mn3 n2 c 0 0 nmos L=1 W='Wn3'.ends
** Library name: ff** Cell name: ff.subckt flfp data clk q vddgxnand1 o4 o2 o1 vddg nand2xnand2 o1 clk o2 vddg nand2xnand3 o2 o4 clk o3 vddg nand3xnand4 o3 data o4 vddg nand2xnand5 o2 o6 q vddg nand2xnand6 o5 o3 o6 vddg nand2.ends
** 64 FF's modeled herexa d clk q vddgl flfp M=64
Figure 5.15: The 6-NAND gate positive edge-triggered flip-flop (FF) circuit and itscorresponding SPICE code. In this example, 64 FF’s are modeled.
CAM input bits are toggled on each access, hence the n/2 coe�cient.
Ebl
= Eml
(5.4)
Esl
= (n/2)⇥ Ewl
(5.5)
Stage Registers A SPICE model for a 6-NAND gate positive edge-triggered flip-
flop (FF), shown in Figure 5.15, is used to evaluate the energy and area of pipeline
stage registers. This Figure also shows the corresponding SPICE code to simulate
a 64-bit FF. The corresponding per-access dynamic and static energy numbers are
measured via the pulse measurement analysis shown in Figure 5.10. To account for
the activity factor of each pipeline stage entry, the dynamic energy calculation for all
stage registers assumes half of the FF transistors toggle on each access.
CHAPTER 5. METHODOLOGY 83
Execution Units (EU) Di↵erent 64-bit execution units including the add, multi-
ply, divide units for arithmetic and floating-point operations are developed in Verilog
and simulated in the Design Compiler. The Design Compiler provides per-operation
energy numbers for each unit. Like the previous cases, the simulator uses this in-
formation as unit-energy values to compute the contribution of EU’s to the overall
processor energy.
Wire Energy HotSpot [23] is used for optimal chip floorplan and wire length opti-
mization. To configure HotSpot, two pieces of information are required: (a) processor
connectivity between di↵erent units in terms of the number bits communicated be-
tween them, and (b) the average energy consumed by each hardware unit in the
processor pipeline. The connectivity information for the CG-OoO, OoO and InO
processors is provided to HotSpot according to the pipeline configurations shown in
Figure 5.6. The average energy numbers are collected through the SPICE and Design
Compiler simulations previously described. HotSpot uses these parameters to find the
optimal floorplan arrangement for each processor. It optimizes the foorplan based on
three key parameters: area, temperature, and wire length. In this study, HotSpot
is configured to prioritize area over wire length over temperature. To extract wire
energy numbers from HotSpot, the software is upgraded to convert wire length data
to wire energy values. The energy per access used for wires is 0.08 pJ/b-mm at the
22nm technology node. Once wire energy data is available for all major wires in the
processor floorplan, the numbers are ported to the simulator. Like the previous cases,
these are per-access energy numbers. So, each time the simulator drives a particular
wire, the energy consumed over that wire is incremented. For instance, upon access-
ing a 64-bit wire, the total wire energy is incremented by half of the per access wire
energy; this is because I assume, on average, half of the wire bits are toggled on each
access. The total wire energy consumption numbers are reported at the end of each
simulation.
CHAPTER 5. METHODOLOGY 84
5.5 Chapter Summary
This chapter described the evaluation framework used to study the CG-OoO, OoO,
and InO execution models. It provided detailed discussions on the design aspects of
the compiler framework developed in-house, the functional emulator software built on
top of the Pintool [31] API, and the timing simulator built in-house. It also presented
the energy model used to perform energy studies in this work; some aspects of the
energy model are build on top of existing software technologies such as the Design
Compiler and HotSpot [23], and other aspects are designed in-house.
Chapter 6
Coarse-Grain Out-of-Order
Evaluation
The CG-OoO processor reaches the performance of the OoO processor at 52% of
its total energy cost. This chapter quantifies the energy benefits of the CG-OoO
processor and the pipeline stages that contribute to its superior energy profile. It
also evaluates the sources of performance gain in detail.
6.1 Sources of Energy Cost
Figure 6.1 shows the energy overhead of the OoO processor compared to the InO
processor. Half of the energy overhead is in table accesses associated with dynamic
instruction scheduling; the remaining half is broken into accessing tables that are
either expensive to access due to their large size, or are accessed too frequently. The
register file and re-order bu↵er are examples of large and expensive-to-access tables;
the register rename tables and branch prediction unit tables are examples of tables
accessed more frequently than necessary.
Figure 6.2a depicts the pipeline stages for a standard out-of-order (OoO) pro-
cessor, and Figure 6.2b shows its energy breakdown. The pipeline stages are color
coded to match their corresponding stage energy pie. This figure suggests the OoO
energy consumption is fairly evenly distributed across all pipeline stages. As a result,
85
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 86
27%$ 27%$
36%$
18%$
19%$
0%$
20%$
40%$
60%$
80%$
100%$
InO$ OoO$
Normalized
+EPC
+
Normalized+Average+Energy+Per+Cycle+(SPEC+Int+2006)+
High$Frequency$Table$Access$
High$Cost$Table$Access$
Dynamic$Scheduling$
Base$
Figure 6.1: Harmonic mean energy per cycle (EPC) of the SPEC Int 2006 benchmarksrun on in-order and out-of-order processors. The OoO energy overhead is divided intothree major categories.
an energy e�cient architecture alternative must enable saving energy on nearly all
pipeline stages. The CG-OoO execution model enables energy saving opportunities
throughout all pipeline stages, leading to 48% less average energy overhead compared
to the OoO processor at the same performance level.
6.2 CG-OoO Design Characterization
At a high level, the CG-OoO processor di↵ers from the OoO processor in e�ciently
storing in-flight operations and e�ciently accessing register file data. To do so, it
utilizes a number of tables; namely the BW FIFO queues, BW Head Bu↵ers, BW
LRF’s, and BROB. In this section, I discuss the approach in designing each table
size.
Figure 6.3a shows the average number of dynamic code blocks; on average, four
code blocks would be in-flight. Figure 6.3b shows the average number of instructions
per code block is ten. These figures suggest the average number of in-flight oper-
ations varies between 30 to 185 depending on the benchmark characteristics. The
average over all SPEC Int 2006 benchmarks is 40 in-flight instructions. The CG-OoO
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 87
L1 Instruction Cache
L1 Data Cache
L2Cache
FetchPC
BranchPrediction
Decode
Rename
Dispatch
ROB
Instr.Window
Scheduler
EU EU EU EULSU
RegisterFile
(a)
13%$
15%$
9%$
31%$1%$
15%$
13%$
4%$
OoO#Energy#Breakdown#
branch$pred$
fetch$
decode$/$rename$
issue$
execute$
memory$
reg$file$
commit$
(b)
Figure 6.2: The energy breakdown of the OoO processor mapped to its pipelinestages. The pie chart shows the energy of the execution stage is about 1% of the totalenergy; it encapsulates the energy of the execution pipeline stage, wires, and EU’s.The chart also shows the OoO energy breakdown is quite well distributed across allpipeline stages.
architecture evaluated in this chapter is designed to support up to 9 block windows,
each holding up to 15 dynamic instructions. When a code block has more than 15
instructions, the front-end stalls until earlier instructions in the corresponding BW
are issued. 1
Each Head Bu↵er holds up to four operations. My experiments show that upon
a head-of-queue stall situation, often times, an independent instruction exists within
a distance of three operations in the same BW that can hide the stall latency. Thus,
as shown in Section 6.3, a Head Bu↵er with 4 instruction entries delivers enough
instruction-level parallelism (ILP) to reach the performance of the OoO processor. 2
The number of entries in each LRF is set to 20. It turns out 20 registers is su�cient
to avoid register spilling in the SPEC Int 2006 benchmarks. Reducing register spilling
1Notice dynamic splitting of a large code block is not permissible because it may lead to violatinglocal register communication assumptions made by the compiler.
29 BW’s each with 15 instruction queue FIFO’s and 4-entry head bu↵ers can hold up to 171in-flight instructions
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 88
0"1"2"3"4"5"6"7"8"9"
"Perl"
"Bzip2""Gcc""Mcf"
"Gobmk"
"Hmmer"
"Sjeng"
"Libquantum
"
"H264ref"
"Omnetpp"
"Astar"
"Xalancbmk"
Average"
Avg.%Num
ber%o
f%BWs%
Average%Number%of%Ac3ve%BWs%
(a)
0"
5"
10"
15"
20"
25"
"Perl"
"Bzip2""Gcc""Mcf"
"Gobmk"
"Hmmer"
"Sjeng"
"Libquantum
"
"H264ref"
"Omnetpp"
"Astar"
"Xalancbmk"
Average"N
umbe
r'of'Instruc/o
ns''/'Block'
Average'Block'Size'
(b)
Figure 6.3: (a) the average number of in-flight code blocks in the CG-OoO for SPECInt 2006. (b) the average number of operations per dynamic code block.
reduces the need for additional MOV operations which in turn saves energy.
The number of BROB entries is set to 16. As mentioned in Chapter 4, BW’s
become available as soon as their last instruction is issued for execution. Thus,
at runtime, the number of in-flight code blocks can be larger than the number of
processor BW’s. In this chapter, all evaluation results use a 16-entry BROB to avoid
structural hazards due to BROB size.
6.3 CG-OoO Performance Analysis
As discussed in Chapter 4, in the CG-OoO processor, instruction-level parallelism
(ILP) is extracted from multiple sources; namely static block-level list scheduling, dy-
namic block level parallelism (BLP), and limited dynamic instruction level parallelism.
Although all processors presented in this evaluation are 4-wide superscalar ma-
chines, the CG-OoO model supports 12 EU’s spread across 3 clusters. In other words,
the CG-OoO front-end is 4-wide, but the total number of execution units is 12. The
higher availability of computation resources allows exploiting more instruction level
parallelism. This can be seen for Hmmer, Bzip2, and Libquantum benchmarks in
Figure 6.4.
Figure 6.5 illustrates the case where the CG-OoO is entirely 4-wide; that is, the
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 89
1.03% 1.08% 1.00% 0.99% 0.91%1.13%
0.95%
1.42%
0.82%0.93% 0.95% 1.05% 1.01%
0.0%0.3%0.5%0.8%1.0%1.3%1.5%
%Perl%
%Bzip2%
%Gcc%
%Mcf%
%Gobmk%
%Hmmer%
%Sjeng%
%Libquantum
%
%H264ref%
%Omnetpp%
%Astar%
%Xalancbmk%
%Harm.%Mean%
Speedu
p&Performance&
(OoO&Baseline&5&Width=4)&
CGKOoO% InO%
Figure 6.4: Performance of the CG-OoO and InO processors normalized to the per-formance of a 4-wide OoO processor. Here, the performance is measured in terms ofinstructions per cycle (IPC). All processors are configured to have a 4-wide front-end.
front-end is 4-wide and the number EU’s is also 4 (on a single cluster). In this case,
the CG-OoO performance is throttled to 7% less than the 4-wide OoO processor
baseline.
The first source of performance gain is static block-level list scheduling. Figure 6.6
shows the e↵ect static scheduling on performance. On average, static scheduling
increases the CG-OoO performance by 14%. In case of Hmmer, 19% more memory
level parallelism is observed with the original binary schedule generated using gcc
(with -O3 optimization flag) than the schedule generated using the block-level list
scheduling compiler pass. The higher memory level parallelism is due to a superior
global code schedule which in turn leads to fewer stall cycles due to better inter-BW
data communication through the GRF. In both cases, Hmmer performs better than
the OoO baseline model.
The next source of performance gain is through block level parallelism3. To illus-
trate the contribution of block level parallelism, let us assume each BW can issue up
to four operations in-order; that is, if an instruction at the head of a BW queue is
not ready to issue, younger, independent operations in the same queue do not issue.
Other BW’s, however, can issue ready operations to hide the latency of the stalling
3Block level parallelism is defined in Chapter 3.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 90
0.99$ 0.94$ 0.95$ 0.99$0.88$
0.99$0.93$ 0.96$
0.78$0.90$ 0.90$
0.81$0.93$
0$
0.2$
0.4$
0.6$
0.8$
1$
1.2$
$Perl$
$Bzip2$
$Gcc$
$Mcf$
$Gobmk$
$Hmmer$
$Sjeng$
$Libquantum
$
$H264ref$
$Omnetpp$
$Astar$
$Xalancbmk$
$Harm.$Mean$
Speedu
p&
Performance&(OoO&Baseline&5&45Wide&Machines)&
CGLOoO$ InO$
Figure 6.5: Performance of the CG-OoO and InO processors normalized to the per-formance of a 4-wide OoO processor. Here, the performance is measured in terms ofinstructions per cycle (IPC). All processors are configured to have a 4-wide front-endand back-end.
0.87%1.01%
0.4%
0.6%
0.8%
1.0%
1.2%
1.4%
1.6%
%Perl%
%Bzip2%
%Gcc%
%Mcf%
%Gobmk%
%Hmmer%
%Sjeng%
%Libquantum
%
%H264ref%
%Omnetpp%
%Astar%
%Xalancbmk%
%Harm.%Mean%
Speedu
p&
Effect&of&Sta.c&Scheduling&on&Performance&(OoO&Baseline&;&Width&=&4)&
No%Blk.%List%Sch.% Blk.%List%Sch.%
Figure 6.6: E↵ect of static block-level list scheduling on CG-OoO performance. Onaverage, the CG-OoO is 14% faster with static scheduling. In all cases, the Skipahead4 dynamic scheduling model is used.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 91
1.03% 1.08% 1.00% 0.99% 0.91%1.13%
0.95%
1.42%
0.82%0.93% 0.95% 1.05% 1.01%
0.0%0.3%0.5%0.8%1.0%1.3%1.5%
%Perl%
%Bzip2%
%Gcc%
%Mcf%
%Gobmk%
%Hmmer%
%Sjeng%
%Libquantum
%
%H264ref%
%Omnetpp%
%Astar%
%Xalancbmk%
%Harm.%Mean%
Speedu
p&Performance&
(OoO&Baseline&5&Width&=&4)&
InO% No%Skipahead% Skipahead%2% Skipahead%3% Skipahead%4% Skipahead%5%
Figure 6.7: The performance characteristics of the Skipahead model on performance.Without Skipahead, only 17% of the gap between OoO and InO is closed. WithSkipahead 2, another 67% of the gap is closed, and with Skipahead 4, the entire gapis closed. All CG-OoO results use the statically list scheduled code.
BW. The No Skipahead bar in Figure 6.7 refers to this setup. It shows on average
17% of the performance gap between the InO and OoO is closed through BLP. Bench-
marks like H264ref and Sjeng exhibit better performance for the InO model. This is
because the InO processor has a shallower pipeline depth (7 cycles) compared to the
CG-OoO processor (13 cycles) allowing faster control mis-speculation recovery.
The last source of performance improvement in CG-OoO is the limited out-of-
order instruction scheduling within each BW; this feature is enabled through the
Head Bu↵er tables. Figure 6.7 shows the performance gain obtained via varying the
number of HB entries. Skipahead 2 refers to a HB with two entries; Skipahead 2
closes an additional 67% of the performance gap between InO and OoO. The 4-entry
HB model (i.e. Skipahead 4) closes the rest of the performance gap. No significant
performance di↵erence is observed for larger Head Bu↵er sizes.
Figure 6.8 shows the performance characteristics of CG-OoO when varying the
processor front-end width from 1 to 8. Comparing the harmonic mean results for
the OoO and CG-OoO shows the CG-OoO processor is superior on narrower designs.
A wider front-end delivers more dynamic operations to the back-end. Because the
OoO model has access to all in-flight operations, it can exploit a larger e↵ective
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 92
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4""Perl"
"Bzip2"
"Gcc"
"Mcf"
"Gobmk"
"Hmmer"
"Sjeng"
"Libquantum"
"H264ref"
"Omnetpp"
"Astar"
"Xalancbmk"
"Harm
."Mean"
"Perl"
"Bzip2"
"Gcc"
"Mcf"
"Gobmk"
"Hmmer"
"Sjeng"
"Libquantum"
"H264ref"
"Omnetpp"
"Astar"
"Xalancbmk"
"Harm
."Mean"
CGHOoO" OoO"
Normalized
+Spe
edup
+
Performance,+Frontend+Width+Sweep+(OoO+Baseline+=+Width+=+4)+
Width"="1"
Width"="2"
Width"="4"
Width"="8"
Figure 6.8: The e↵ect of the front-end width on speedup of the CG-OoO and OoOis evaluated. 1, 2, 4, and 8-wide processor models are evaluated here. The resultsindicate that the OoO performance scales better with a wider front-end.
instruction window. Despite the larger number of in-flight operations, the CG-OoO
model maintains a limited view to the in-flight operations making an 8-wide CG-OoO
machine not much superior to its 4-wide counterpart.
6.4 CG-OoO Energy Analysis
The CG-OoO execution model augments an energy e�cient design solution to the
dynamic out-of-order execution model. In doing so, it improves energy e�ciency of
most pipeline stages. Overall, CG-OoO shows an average 48% energy reduction for
the SPEC Int 2006 benchmarks. Figure 6.9a shows the overall energy level for the
CG-OoO, OoO, and InO processors; Figure 6.9b shows the harmonic mean energy
breakdown for di↵erent pipeline stages; all benchmarks follow a similar energy break-
down trend as the harmonic mean. This figure shows the main energy savings are
in the Branch Prediction, Register Rename, Issue, Register File access, and Commit
stages. In this section, the source of energy saving within each stage is discussed.
Given the main contribution of this study is on building an energy e�cient processor
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 93
core, Figure 6.9c excludes cache and memory system energy. Excluding cache and
memory system, CG-OoO results in a 61% average energy saving versus an OoO pro-
cessor with similar performance. All processor evaluations in this work use the same
cache model.
Figure 6.10 shows the inverse of energy-delay (ED) product indicating the favor-
able energy-delay characteristics of the CG-OoO over OoO for all benchmarks even
those that fall short of the OoO performance such as Sjeng and Gobmk. The average
of the inverse of the ED product is 1.9.
Figure 6.11 shows the static and dynamic energy breakdown for di↵erent bench-
marks relative to the OoO baseline. On average, the leakage energy is smaller than
4% of the total energy.
Next, let us focus on the energy saving characteristics of each of the CG-OoO
pipeline stages.
6.4.1 Block Level Branch Prediction
Block-level branch prediction is primarily focused on saving energy by accessing the
branch prediction unit at block level granularity rather than fetch-group level gran-
ularity. For a benchmark application with average block size of eight running on a
4-wide processor, this translates to roughly 2x reduction in the number of accesses to
the branch prediction tables. Figure 6.3b shows the average block sizes for SPEC Int
2006 benchmarks. Figure 6.12 shows the relative energy-per-cycle for the CG-OoO
model compared to the OoO baseline. On average, Block Level Branch Prediction
is 53% more energy e�cient than the OoO model. Hmmer shows 83% reduction
in branch prediction energy because of its larger average code block size (see Fig-
ure 6.3b).
6.4.2 Register File Hierarchy
The CG-OoO register file hierarchy contributes to the processor energy savings in
four di↵erent ways, each of which is discussed here.
• Local Register Files are low energy tables
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 94
0%#
25%#
50%#
75%#
100%#OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
OoO
#
CG,OoO
#
InO#
#Perl# #Bzip2# #Gcc# #Mcf# #Gobmk# #Hmmer# #Sjeng# #Libquantum# #H264ref# #Omnetpp# #Astar# #Xalancbmk# #Harm.#Mean#
Normalized
+EPC
+
Normalized+Processors+Energy+(SPEC+Int+2006,+OoO+Baseline,+Width+=+4)+
(a) Normalized energy per cycle (EPC) of all processors (including the cache). On average,the CG-OoO is 48% more energy e�cient than the OoO.
0%#
25%#
50%#
75%#
100%#
OoO# CG,OoO# InO#
#Harm.#Mean#
Processor'Energy'Breakdown'Averages'(SPEC'Int'2006)'
commit#
reg#file#
memory#
execute#
issue#
decode#/#rename#
fetch#
branch#pred#
(b) Harmonic mean EPC breakdown of all processor models. This figurehighlights five major sources of energy saving in the CG-OoO processorpipeline; each of them is discussed in detail in this chapter.
42%$ 40%$46%$ 51%$
41%$31%$
41%$ 46%$34%$
41%$34%$
45%$39%$
0%$
25%$
50%$
75%$
100%$
$Perl$
$Bzip2$
$Gcc$
$Mcf$
$Gobmk$
$Hmmer$
$Sjeng$
$Libquantum
$
$H264ref$
$Omnetpp$
$Astar$
$Xalancbmk$
$Harm.$Mean$
Normalized
+EPC
+
Normalized+Core+Energy+Consump5on++(OoO+Baseline,+Width+=+4)+
CGLOoO$ InO$
(c) Normalized energy per cycle (EPC) of the core excluding the cache. On average,the CG-OoO core is 61% more energy e�cient than the OoO.
Figure 6.9: CG-OoO, OoO, and InO normalized energy per cycle (EPC) results.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 95
1.9$2.1$
1.5$ 1.6$ 1.6$
2.3$
1.9$
2.3$
1.5$ 1.4$
2.0$
1.6$1.9$
0.0$
0.5$
1.0$
1.5$
2.0$
2.5$
$Perl$
$Bzip2$
$Gcc$
$Mcf$
$Gobmk$
$Hmmer$
$Sjeng$
$Libquantum
$
$H264ref$
$Omnetpp$
$Astar$
$Xalancbmk$
$Harm.$Mean$
Normalized
+1/ED+(1/Js)+
Energy7Delay+Product+Inverse+(O3+Baseline,+Width=4)+
CGJOoO$
Figure 6.10: The inverse of energy-delay product for the CG-OoO design normalizedto the OoO design (higher is better). Overall, the CG-OoO is 1.9x more e�cient thanthe OoO.
0%#
25%#
50%#
75%#
100%#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
CG*OoO
#
OoO
#
InO#
Perl# Bzip# #Gcc# Mcf# #Gobmk# #Hmmer# #Sjeng# #Libquantum# #H264ref# #Omnetpp# #Astar# #Xalancbmk# #Hearm.#Mean#
Normalized
+EPC
+
Sta1c+&+Dynamic+Energy+(SPEC+Int+2006)+
Dynamic#Energy# StaOc#Energy#
Figure 6.11: Static and dynamic EPC normalized to the OoO processor (lower isbetter). Overall, static EPC is about 4% of the total EPC.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 96
49%$ 54%$
74%$ 78%$
58%$
17%$
66%$
53%$43%$
73%$63%$
78%$
47%$
0%$
20%$
40%$
60%$
80%$
100%$
$Perl$
$Bzip2$
$Gcc$
$Mcf$
$Gobmk$
$Hmmer$
$Sjeng$
$Libquantum
$
$H264ref$
$Omnetpp$
$Astar$
$Xalancbmk$
$Harm.$Mean$
Normalized
+EPC
+CG0OoO+BPU+Energy+
(OoO+Baseline+0+Width=4)+
Figure 6.12: The branch prediction unit table access EPC normalized to the OoOprocessor. Overall, the CG-OoO BPU is 53% more e�cient than that of the OoO.
• Register Rename Bypass is enabled for local operands
• Segmented Global Register Files reduce access energy
• Register Renaming optimization reduce on-chip data movement
Local register files (LRF) are statically allocated. As a result, each Block Window
holds an exclusive LRF. The 20-entry LRF energy-per-access is about 25⇥ smaller
than that of a unified, 256-entry register file in the baseline OoO processor. The
LRF has 2 read and 2 write ports and the unified register file has 8 read and 4
write ports. In addition, since each BW holds a LRF near its instruction window
and execution units, operand read and writes take place over shorter average wire
lengths. LRF’s also enable additional energy saving by avoiding local write-operand
wakeup broadcasts. Figure 6.13 shows the contribution of the local register file energy
compared to the OoO baseline; it shows an average 26% reduction in register file
energy consumption due to local register accesses.
Because local register operands are statically allocated, they do not require register
renaming. As a result, 23% average energy consumption reduction is observed in the
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 97
0%#
5%#
10%#
15%#
20%#
25%#
30%#
35%#CG
*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
CG*OoO
#
InO#
#Perl# #Bzip2# #Gcc# #Mcf# #Gobmk# #Hmmer# #Sjeng# #Libquantum# #H264ref# #Omnetpp# #Astar# #Xalancbmk# #Harm.#Mean#
Normalizd*EP
C*
Normalized*Register*Files*Energy*(OoO*Baseline*9*Width*=*4)*
S*GRF#9# LRF# RF#
Figure 6.13: The register file (RF) access EPC normalized to the OoO processor.Overall, the CG-OoO RF hierarchy is 94% more e�cient than that of the OoO. S-GRF 9 shows the energy of a CG-OoO GRF with 9 segments, and RF shows theenergy of an InO processor register file with the same number of ports as the OoOprocessor.
register rename stage (see Figure 6.14).
The global register file used in both OoO and CG-OoO has 256 entries. While
the use of local registers enables the use of a smaller global register file in CG-OoO
without noticeable reduction in performance, in my experiments, I use equal global
register file sizes for fair energy and performance modeling between the CG-OoO and
OoO processors.
As discussed in Chapter 3, to reduce the access energy overhead of a unified register
file and to increase the aggregate number of ports in the CG-OoO, this processor
model breaks the global register file (GRF) into multiple segments. Each segment
is placed next to a BW. The access energy to each register file segment is divided
by the number of segments relative to the OoO unified register file access energy.
Figure 6.13 also shows the contribution of the global register file energy compared
to the OoO baseline; it shows an average 68% reduction in the global register file
energy consumption due to register file segmentation. Notice GRF segmentation is
not commonly used in OoO architectures; some ARM architectures bank the register
file for various purposes such as better thread context switching support [10]. This
means the segmented GRF technique is not exclusive to the CG-OoO, and may be
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 98
0.70$0.78$
0.93$0.85$ 0.82$
0.67$
0.82$0.91$
0.67$
0.80$0.69$
0.86$0.77$
0%$
25%$
50%$
75%$
100%$
$Perl$
$Bzip2$
$Gcc$
$Mcf$
$Gobmk$
$Hmmer$
$Sjeng$
$Libquantum
$
$H264ref$
$Omnetpp$
$Astar$
$Xalancbmk$
$Harm.$Mean$
Normalized
+EPC
+Register+Renaming+Energy+(OoO+Baseline+8+Width+=+4)+
Figure 6.14: The register rename (RR) table access EPC normalized to the OoOprocessor. Overall, the CG-OoO RR is 23% more e�cient than that of the OoO.
pursued as an energy e�ciency technique in OoO processors.
Figure 6.15 shows the e↵ect of register file segmentation on energy. It shows
the case of a unified GRF, one GRF segment per cluster (for a 3-cluster CG-OoO
architecture), and one GRF segment per BW. As the number of register segments
increases, energy consumption decreases linearly.
Placing a segment next to each BW is energy saving when operations read and
write their global operands to the closest GRF segment. The register-rename unit
reduces data communication over wires by allocating to every global write operand
an available physical register from the closest GRF segment.
6.4.3 Instruction Scheduling
The CG-OoO processor introduces the Skipahead issue model. Figure 6.16 shows the
energy breakdown for the CG-OoO dynamic scheduling hardware. In OoO and CG-
OoO, in-flight instructions are maintained in queues that are partly RAM tables and
partly CAM tables. For the InO model, instructions are held in a small FIFO bu↵er.
Figure 6.16 shows the majority of the OoO scheduling energy (75%) is in reading
and writing instructions from the RAM table. Another 20% of the OoO energy is in
CAM table accesses. The “Rest” of the energy is consumed in stage registers and the
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 99
0%#20%#40%#60%#80%#100%#
#Perl#
#Bzip2#
#Gcc#
#Mcf#
#Gobmk#
#Hmmer#
#Sjeng#
#Libquantum
#
#H264ref#
#Omnetpp#
#Astar#
#Xalancbmk#
#Harm.#Mean#
Normalized
+Ene
rgy+
Segmented+Register+File+Energy+Trend+(OoO+Baseline+9+Width+=+4)+
1#Segment#GRF# 3#Segment#GRF# 9#Segment#GRF#
Figure 6.15: The segmented register file access EPC normalized to the OoO processor.As the number of GRF segments increases, the amount of energy saving decreaseslinearly.
interconnects used for instruction wakeup and select. This figure also indicates 90%
average reduction in the CG-OoO RAM table energy (relative to OoO RAM energy)
which is due to accessing smaller SRAM tables, and 95% average reduction in the
CAM table energy which is due to using 2 to 4-entry Head Bu↵ers (HB) instead of
the 128-entry CAM tables used in the OoO instruction queue. The “Rest” average
energy is increased by 40% due to the larger number of pipeline registers at the issue
stage.4 Overall, the CG-OoO issue stage is 84% more e�cient than that of the OoO.
6.4.4 Block Re-Order Bu↵er
The CG-OoO processor maintains program order at block-level granularity. This
makes read-write accesses to the BROB substantially smaller than accesses the OoO
ROB. Block write operations are done after decoding each head and block reads are
done at the commit stage. Instructions access BROB to notify the corresponding
block entry of their completion. While the access mechanism is similar to that of
4Recall that the number of EU’s for the CG-OoO is 12 (4 EU’s per cluster). Since the contributionof “Rest” to the overall energy is quite small, a 40% increase in its energy is insignificant as seen inFigure 6.16.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 100
0%#
20%#
40%#
60%#
80%#
100%#OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
OoO
#
CG-OoO
#
InO#
#Perl# #Bzip2# #Gcc# #Mcf# #Gobmk# #Hmmer# #Sjeng# #Libquantum# #H264ref# #Omnetpp# #Astar# #Xalancbmk# #Harm.#Mean#
Normalized
+EPC
+
Dynamic+Scheduler+Energy+(OoO+Baseline+;+Width+=+4)+
REST# RAM# CAM#
Figure 6.16: The instruction issue EPC normalized to the OoO processor. Overall, theCG-OoO issue stage is 84% more e�cient than that of the OoO. This energy is brokeninto three main categories to better highlight the source of energy consumption.
the OoO processor, the frequency of accesses to the BROB is substantially smaller,
making it much cheaper to maintain program order in the CG-OoO.
In addition, since the BROB is designed to maintain program order at block
granularity, it is provisioned to have 16 entries rather than 160 entries commonly
used in modern OoO processors. The 10⇥ reduction in the re-order bu↵er size makes
all read-write operations 10⇥ less energy consuming. Figure 6.17 shows 76% average
energy saving for the CG-OoO model.
6.5 Cluster Analysis
As discussed in Chapter 4, the CG-OoO architecture focuses on reducing processor
energy through designing a complexity-e↵ective architecture; to remain competitive
with the OoO performance, this architecture supports a larger number of execution
units (EU). To do so, the CG-OoO model must employ a design strategy that is
more scalable than the OoO. Clustered execution was discussed in Chapter 4 as the
technique used to scale the number of execution units. A cluster consists of a number
of BW’s sharing a number of EU’s. To illustrate the e↵ect of di↵erent clustering
configurations, the experimental results in this section assume three clusters.
Figures 6.18a and 6.18b show the normalized average performance and energy
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 101
22%#25%# 25%# 27%#
23%#25%# 23%#
27%#21%#
25%#
19%#
26%# 24%#
0%#
10%#
20%#
30%#
40%#
#Perl#
#Bzip2#
#Gcc#
#Mcf#
#Gobmk#
#Hmmer#
#Sjeng#
#Libquantum
#
#H264ref#
#Omnetpp#
#Astar#
#Xalancbmk#
#Harm.#Mean#
Normalized
+EPC
+Commit+Stage+Energy+
(OoO+Baseline+8+Width=4)+
CGLOoO#
Figure 6.17: The instruction commit EPC normalized to the OoO processor. Overall,the CG-OoO commit stage is 76% more e�cient than that of the OoO.
of SPEC Int 2006 benchmarks versus the number of BW’s per cluster for various
number of EU’s per cluster. The speedup figure shows some clustering configurations
reach beyond the performance of the OoO. All clustering models exhibit substantially
lower energy consumption overhead compared to the OoO design. The most energy
e�cient configuration is the one with 1 BW and 1 EU per cluster; it is 63% more
energy e�cient than the OoO, but only 65% of the OoO performance. The most
high-performance configuration evaluated here is the one with 6 BW’s and 8 EU’s
per cluster; it is 39% more energy e�cient than the OoO, and 104% of the OoO
performance. The design configuration studied throughout this chapter corresponds
to the cross-over performance point with 3 BW’s and 4 EU’s per cluster.
Figure 6.19 shows the energy-performance characteristics of the CG-OoO model
plotting all the cluster configurations presented above. The lowest energy-performance
point in the plot refers to the 1 BW, 1EU per cluster configuration and the highest
energy-performance point refers to the 6 BW, 8 EU per cluster configuration. This
figure suggests as the processor resource complexity increases, the energy-performance
characteristics grow proportionally.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 102
0.5$
0.6$
0.7$
0.8$
0.9$
1.0$
1.1$
1$BW$ 2$BW$ 3$BW$ 6$BW$
Speedu
p&Harmonic&Mean&Speedup&(OoO&Baseline&5&Width&=&4)&
1$EU$ 2$EU$ 4$EU$ 8$EU$ OoO$ InO$
(a)
20%$
30%$
40%$
50%$
60%$
70%$
80%$
90%$
100%$
1$BW$ 2$BW$ 3$BW$ 6$BW$
Normalized
+EPC
+
Harmonic+Mean+Normalized+Energy+/+Cycle+(OoO+Baseline+:+Width+=+4)+
1$EU$ 2$EU$ 4$EU$ 8$EU$ OoO$ InO$
(b)
Figure 6.18: Normalize Performance & Energy for di↵erent clustering configurations.All configurations assume a 3-cluster CG-OoO model; the total number of BW’s andEU’s is calculated through multiplying the above numbers by 3. Here, performanceis measured as the harmonic mean of the IPC and the energy is measured as theharmonic mean of the EPC over all the SPEC Int 2006 benchmarks.
Beyond a certain point in the scaling of this machine, the wakeup/select and load-
store unit wire latencies become so large that the energy-performance proportionality
of the CG-OoO will break. The study of identifying the energy-performance break-
point is outside of the scope of this work.
6.6 Chapter Summary
This chapter discussed the energy and performance evaluations for the CG-OoO pro-
cessor in comparison to the OoO and InO processor baselines. All evaluations are done
on the SPEC Int 2006 benchmark suite. The performance results consider various
processor widths and instruction scheduling models. The energy results are broken
down into stage-energy results to highlight the savings opportunities at each individ-
ual pipeline stage. Overall, the CG-OoO processor is 50% more energy e�cient than
the OoO processor at the same level of performance. Furthermore, in this chapter,
the energy versus performance scaling of the this processor is evaluated. Unlike the
OoO processor, the CG-OoO is an execution model that delivers energy-performance
proportionality for a wide range of execution resource configurations.
CHAPTER 6. COARSE-GRAIN OUT-OF-ORDER EVALUATION 103
0.2$
0.4$
0.6$
0.8$
1$
0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ 1.1$
Normalized
+Pow
er+
Normalized+Performance+
Normalized+Power+vs.+Performance+
CG.OoO$ OoO$ InO$ Linear$(CG.OoO)$
Figure 6.19: The energy versus performance plot showing di↵erent CG-OoO configu-rations normalized to the OoO processor. The CG-OoO core configurations illustratethe energy-performance proportionality attribute of the CG-OoO processor. Here,performance is measured as the harmonic mean of the IPC and the energy is mea-sured as the harmonic mean of the EPC over all the SPEC Int 2006 benchmarks.
Chapter 7
Related Work
This chapter discusses the previous literature related to the subject of e�cient single-
threaded general purpose processing and explains the key di↵erentiating factors be-
tween them and the CG-OoO architecture. Overall, the CG-OoO di↵ers from the
existing architectures in its focus on building a bottom-up design heavily focused on
energy e�ciency. The CG-OoO also devises unique architectural features to reach
competitive computation performance; however, obtaining exceptional performance
remains a second priority in this work. The CG-OoO leverages compiler solutions as
well as complexity e↵ective architectural solutions that, to the authors’ knowledge,
bring the highest reported energy e�ciency results at the performance of the OoO
processor baseline.
7.1 CG-OoO Design Features
Several high-level design features distinguish the CG-OoO processor from the pre-
vious work. Table 7.1 summarizes these key di↵erentiating elements and compares
them to related architectures in the literature. The CG-OoO leverages a distributed
micro-architecture capable of issuing instructions from multiple code blocks concur-
rently (column 1). The key enablers of energy e�ciency in the CG-OoO are (a) its
end-to-end complexity e↵ective design (column 3), and (b) its e↵ective use of com-
piler assistance in doing code clustering and generating e�cient static code schedules
104
CHAPTER 7. RELATED WORK 105
(column 6). Despite the heavy reliance of the CG-OoO architecture in providing
energy e�ciency opportunities, CG-OoO requires no profiling support to deliver the
energy and performance e�ciency it o↵ers (column 4). The CG-OoO is an energy-
proportional design capable of scaling its hardware resources to larger or smaller
computation units according to the workload demands of programs at runtime. En-
ergy proportionality is provided via the clustered architecture design discussed in
Chapter 4 (column 5). The CG-OoO supports an out-of-order issue model (column
7) at block granularity and a limited out-of-order issue model at instruction granular-
ity (i.e. within block). It also supports a hierarchical register file for energy e�ciency
purposes (column 8). Furthermore, unlike most previous studies, this work performs
a detailed processor energy modeling analysis (column 2).
Braid [53] focuses on static partitioning of program basic-block instructions into
braids using the program data-flow graph. Each braid runs as an independent block of
code similar to how the CG-OoO runs its code blocks. Clustering static instructions
at sub-basic-block granularity requires additional control instructions to guarantee
accurate execution of the program control flow at runtime. Adding these instruc-
tions causes instruction cache pressure and additional energy overhead for processing
the instructions. In the CG-OoO, the compiler clusters all basic-block instructions
as a whole rather than fragmenting them into smaller clusters. Furthermore, while
Braid follows the cycle-by-cycle convention for accessing the branch prediction unit,
the CG-OoO performs block level branch prediction. Similarly, the Braid architec-
ture performs instruction-level commit and squash, whereas the CG-OoO architec-
ture supports commit and squash at block granularity to save energy. Also, each
braid execution unit can issue up to two instructions per cycle in-order (InO). The
CG-OoO architecture introduces the Skipahead issue model to improve the CG-OoO
instruction-level parallelism without causing any noticeable energy overheard. Un-
like Braid, the CG-OoO architecture also incorporates segmented global registers to
further improve the processor energy e�ciency.
WiDGET [55] is a power-proportional grid execution design consisting of a decou-
pled thread context management and a large set of simple execution units. It has a
dynamic instruction steering protocol to e↵ectively perform instruction scheduling at
CHAPTER 7. RELATED WORK 106
DESIGN Distributedµ-architecture
/Coarse-Grain
Energy
Mod
el
Com
plexity
E↵ective
Design
ProfilingNOT
don
e
PipelineClustering
Static&
Dyn
amic
Scheduling
Out-of-O
rder
RegisterFileHierarchy
CG-OoO
Braid [53]
WiDGET [55]
TRIPS / EDGE [8,46]
Multiscalar [49]
Complexity E↵ective [39]
Trace Processors [43]
MorphCore [28]
BOLT [21]
iCFP [20]
ILDP [29]
WaveScalar [29]
Table 7.1: Eight high level design features of the CG-OoO architecture comparedagainst the related work in the literature.
CHAPTER 7. RELATED WORK 107
runtime. WiDGET is an extension of the work by Salverda and Zilles’s [44] on design-
ing an instruction scheduling cost model. The CG-OoO is a bulk code scheduling so-
lution that follows a similar clustering design to deliver energy-proportionality while
aggressively focusing on improving energy e�ciency. WiDGET performs dynamic
data dependency detection to steer instructions to a particular execution flow while
the CG-OoO clusters instructions at compile time. WiDGET processes instructions
at block level granularity and performs out-of-order instruction issue. The CG-OoO
model replaces the out-of-order instruction issue with the Skipahead issue model that
is substantially more energy e�cient. Lastly, unlike the CG-OoO, WiDGET does not
support coarse-grain squash and commit.
Multiscalar [49] evaluates a multi-processing unit capable of steering coarse grain
code segments that are often larger than a basic-block to its processing units for com-
putation. Multiscalar replicates register context for each computation unit, increasing
the data communication across its register files. In contrast, the CG-OoO partitions
the global register file so that each computation unit holds a segment of the register
file to reduce read/write access energy. Multiscalar can be configured as an OoO or
InO processor. The CG-OoO supports limited OoO execution.
The Complexity E↵ective paper [39] proposes a distributed instruction window
that simplifies the wake-up logic, issue window, and the forwarding logic design. In
this paper, instruction scheduling and steering is done at instruction granularity,
whereas in the CG-OoO an entire code block is assigned to a block window. The
same holds for how instructions are squashed and committed.
ILDP [29] proposes an architecture for distributed processing that consists of a
hierarchical register file built for communicating short-lived registers locally and long-
lived registers globally. ILDP relies on profiling and in-order instruction scheduling
from each processing unit. The CG-OoO has a similar register file hierarchy, uses no
profiling, performs limited OoO instruction scheduling, and utilizes the coarse-grain
execution model.
TRIPS / EDGE [9] [46] is a high-performance, grid-processing architecture that
uses static instruction scheduling in space and dynamic scheduling in time. It uses
Hyperblocks [33] to map instructions to the grid of computational units. Its primary
CHAPTER 7. RELATED WORK 108
focus is extracting as much instruction-level parallelism (ILP), thread-level parallelism
(TLP), and data-level parallelism (DLP) from the program as possible. TRIPS de-
livers high computation performance despite the multi-cycle delay for transmitting
signals over on-chip wires. Hyperblocks use branch predication to group basic-blocks
that are connected together through weakly biased branches. While e↵ective for im-
proving instruction parallelism, Hyperblocks lead to energy ine�cient mis-speculation
recovery events. The CG-OoO architecture benefits from an energy e�cient static
and dynamic instruction scheduling hybrid. To construct Hyperblocks, the TRIPS
compiler needs program profiling information. Profiling is not required in the CG-
OoO.
iCFP [20] addresses the head-of-queue blocking problem in the InO processor by
building an architecture that, on every cache miss, checkpoints the program context,
steers miss-dependent instructions to a side bu↵er enabling miss-independent instruc-
tions to make forward progress. CFP [51] addresses the same problem in the OoO
processor. It enables the use of a small register file and instruction window while
maintaining the same level of performance as conventional OoO processors. Simi-
larly, BOLT [21], Flea Flicker [3], and Runahead Execution [38] are high ILP, high
MLP 1, latency-tolerant (LT) architecture designs for energy e�cient out-of-order ex-
ecution. All these architectures follow the runahead execution model. BOLT uses a
slice bu↵er design that utilizes minimal hardware resources. Unlike iCFP, CFP, and
BOLT that perform lightweight dynamic instruction scheduling to manage available
instructions and hide LLC cache misses, the CG-OoO utilizes multiple issue queues
to provide LLC miss latency tolerance instead of a side bu↵er. Furthermore, unlike
the CG-OoO, none of these publications target an energy-proportional architecture.
Trace Processors [43] proposes a register file hierarchy and instruction flow design
based on dynamic trace processing in the front-end. Similar to the CG-OoO, the
register file hierarchy in the Trace Processors consists of several local register files and
a global register file. This architecture uses dynamic code traces to cluster instructions
for dispatch to di↵erent processing elements. The CG-OoO uses control-flow graph
(CFG) basic-blocks to cluster instructions, and performs a combination of static and
1MLP: Memory level parallelism.
CHAPTER 7. RELATED WORK 109
dynamic instruction scheduling to save energy.
MorphCore [28] is an in-order, out-of-order hybrid architecture designed to im-
prove single-threaded energy e�ciency. It utilizes either core depending on the pro-
gram state and resource requirements. Unlike the CG-OoO, it uses dynamic instruc-
tion scheduling to execute and commit instructions. MorphCore reports 22% improve-
ment in its energy-delay product while the CG-OoO achieves over 95% energy-delay
product improvement when compared against a similar OoO processor baseline.
WaveScalar [52] is an out-of-order data-flow computing architecture that utilized
the WaveCache memory model. The WaveCache combines clusters of execution units
with small data caches and store bu↵ers to form a computing substrate. WaveScalar
focuses on solving the problem of long wire delays by bringing computation close to
data. CG-OoO is focused on designing an energy e�cient core architecture.
7.2 CG-OoO Energy E�ciency Features
My studies on the topic of energy e�cient, single-threaded computing highlighted the
need for constructing a distributed micro-architecture that turns the deep instruction
queue into a multitude of parallel small instruction queues. Such an architecture
allows building high performance and energy e�cient processors. Most such de-
signs, however, have been traditionally focused on achieving superior computation
performance rather than superior energy e�ciency. To the author’s knowledge, WiD-
GET [55] is the only work focused on evaluating the energy trade-o↵ in this class
of design. The CG-OoO advances this class of design by performing an end-to-end
redesign of the architecture based on the coarse-grain execution concept to achieve
superior energy e�ciency.
Superior energy e�ciency and competitive performance of the CG-OoO is due to
the combination of several design techniques; Table 7.2 shows the di↵erent techniques
used in this work and compares against the design techniques in previous designs.
These techniques, combined, outperform the energy e�ciency gains from previous
literature and maintain a competitive performance to that of the OoO. While many
of the design features in this work have been used in other literature, mostly to
CHAPTER 7. RELATED WORK 110
improve processor performance, the CG-OoO showcases their strong energy e�ciency
capability when combined in the context of coarse-grain execution.
7.2.1 Degree of Coarse Granularity
Di↵erent literature have focused on di↵erent degrees of code clustering. TRIPS [46]
utilizes Hyperblocks [33] to cluster instructions, PowerPC 604 [50] utilizes Superblocks [24],
Multiscalar [49] utilizes large CFG fragments, and Braid [53] and WiDGET [55] uti-
lize sub-basic-block instruction clusters. CG-OoO focuses on coarse-grain execution
at basic-block granularity (column 8). In future chapters, I discuss that using basic-
blocks is the ideal choice of instruction granularity for energy e�cient computing.
7.2.2 Front-end Energy E�ciency
The front-end energy e�ciency features consist of (a) coarse branch prediction (col-
umn 1) which helps access the branch prediction tables upon control operations rather
than at every cycle, and (b) the register renaming bypass which limits the register
renaming table accesses to only global register operands (column 2).
Similar to the CG-OoO front-end, BLISS, FTB, and BSA perform branch predic-
tion at basic-block granularity. TRIPS / EDGE [9] [46] and Multiscalar [49] perform
branch prediction at coarser granularity.
TRIPS / EDGE [9] [46] performs coarse branch prediction at the Hyperblock
granularity. A Hyperblock may contain up to 128 operations. Instructions within a
Hyperblock are fetches, executed, and committed as a whole. Since values produces
and consumed within a block are not stored in register banks, TRIPS bypasses reg-
ister renaming and register access for a large fraction of operands. The CG-OoO
architecture finds Hyperblocks an inadequate choice for energy-e�cient computation
due to the execution of predictated operations and very large size of code blocks. The
CG-OoO manages local registers statically and maintains them in small LRF’s.
Hao et al. [19] and Melvin et al. [35] propose the Block Structured ISAs (BSA) to
improve front-end e�ciency and execution performance compared to OoO processors
that use conventional ISAs. These architectures maintain block descriptors along with
CHAPTER 7. RELATED WORK 111
DESIGN Block
Branch
Prediction
RegisterRenam
ingByp
ass
Local
RegisterFile
SegmentedGlobal
RegisterFile
Block-R
OB
(CoarseCom
mit)
Skipah
ead
Coarse-Grain
Squ
ashMod
el
CG:@
Basic-block
CG-OoO
Braid [53] #WiDGET [55] #
TRIPS / EDGE [46] "Multiscalar [49] "
Complexity E↵ective [39]
Trace Processors [43] "BLISS [59], FTB [42]
BSA [19] [35] [56] [26] [47]
BMRF [54]
Table 7.2: Eight key micro-architectural features contributing to the energy e�ciencyof the CG-OoO architecture compared against the related work in the literature.CG in the last column refers to the level or Coarse-Granularity with respect to thegranularity of a basic-block.
CHAPTER 7. RELATED WORK 112
the basic-block code. BLISS [59] and FTB [42] are block-level, energy-aware front-
end architecture designs that maintain block descriptors separate from the basic-block
code and replace the Translation Lookaside Bu↵er (TLB) with a BB-Cache that stores
the block descriptors in programs. These architectures improve instruction cache
energy, improve branch prediction by reducing over- and under-speculation, increase
the ratio of retired to fetched operations, and avoid unnecessary instruction fetches.
The CG-OoO design maintain block descriptors along with the basic-block code
and utilizes the TLB to fetch instructions. The choice of an e�cient front-end is
orthogonal to the focus of this work. The CG-OoO architecture delivers an execu-
tion model platform capable of providing significant energy e�ciency compared to
conventional OoO architectures. The flexible design of the CG-OoO, however, o↵ers
convenient integration opportunities to energy aware, block-level solutions, such as
BLISS.
7.2.3 Back-end Energy E�ciency
The back-end energy e�ciency features consist of (a) the register file hierarchy design
(columns 3 and 4) which is targeted towards combining static and dynamic register
allocation, (b) the unique Skipahead instruction issue design that leverages a dis-
tributed issue queue model and enables limited out-of-order issue (column 6), and (c)
a block-level re-order bu↵er to maintain program order and enable program squash
at a coarse granularity while maintaining support for precise exceptions (columns 5
and 7). Column 5 shows TRIPS maintains program order at block granularity (i.e.
Hyperblock). Column 7 shows Multiscalar and Trace Processors support coarse-grain
squash. However, due to the large size of code blocks, in both architectures, control
mis-speculations expose the pipeline depth more significantly than the CG-OoO.
CHAPTER 7. RELATED WORK 113
7.3 OoO Energy E�ciency Arguments
Czechowski et al. [11] discusses the energy e�ciency techniques used in the recent
generation of the Intel CPU architectures (e.g. Core i7 4770K). It focuses on e↵ec-
tive use of the Micro-op Cache and Loop Cache to bypass excessive fetch and decode
activities on the processor front-end. It also describes the use of Single Instruction
Multiple Data (SIMD) ISA and other circuit-level innovations as means to improve
processor energy and performance. The CG-OoO processor questions the inherent
energy e�ciency attributes of the OoO execution model and provides a solution that
is over 50% more energy e�cient than the baseline OoO. The energy e�ciency tech-
niques discussed in [11] can also be applied to the CG-OoO model to make it even
more energy e�cient.
7.4 Energy Modeling
McPAT [30] and Wattch [7] are widely known energy model tools in the computer
architecture research community. McPAT extends and improves the modeling capa-
bilities of Wattch. I chose to build an independent energy model because of (a) the
simulation inaccuracies of McPAT [57], and (b) implementation di�culties in cor-
rectly making changes to McPAT software to model the CG-OoO processor. To build
an accurate energy model, I build an energy model using industrial-level circuit sim-
ulation tools, the Design Compiler and SPICE. My energy model does not yet have
an extensive evaluation of all processor components (e.g. logic blocks). However,
as discussed earlier, the missing hardware components have secondary e↵ect on the
overall core energy and would be designed similarly for both the CG-OoO and OoO
processors.
7.5 Simulation Framework
Gem5 [4], Marssx86 [40], and ZSim [45] are among the popular simulation frameworks
used for modeling OoO and InO processors. Gem5 has limited support for OoO
CHAPTER 7. RELATED WORK 114
execution on the x86 ISA. Marssx86 is an extension of PTLSim [58] designed for
x86 simulations. Neither of these simulators have flexibility for modifications to the
input ISA. Recall the CG-OoO makes changes to the x86 ISA. ZSim is a Pintool-
based simulator that targets fast multi-core simulation with limited single-threaded
simulation details support. All these simulators require extensive modification to
produce cycle-by-cycle energy evaluations. As a result, they are often coupled with
McPAT for energy evaluation. As discussed in Chapter 5, my simulator is based
on Pintool and performs detailed OoO, InO, and CG-OoO single-threaded timing
simulation. It also has built-in support for cycle-by-cycle energy modeling.
7.6 Chapter Summary
In this chapter, I compared the architectural characteristics of the CG-OoO processor
against the previous literature and identified the main di↵erentiating parameters of
the CG-OoO. Braid [53] and WiDGET [55] are the two architectures closest to the
CG-OoO design; however, neither work studies all the essential design parameters for
building a highly energy e�cient single-threaded processor. To the author’s knowl-
edge, the CG-OoO design is the only architecture with over 50% reduction in energy
consumption at the same performance level as the baseline OoO processor.
Chapter 8
Conclusion
8.1 Summary of Thesis Contributions
In this thesis, I presented the Coarse-Grain Out-of-Order (CG-OoO) processor archi-
tecture. This processor is unique in its ability to save over 50% of the out-of-order
processor energy while maintaining the same level of performance. The CG-OoO con-
sists of several energy e�ciency solutions including compiler optimization techniques
to improve the code energy footprint, and hardware techniques designed to bring
energy e�ciency to the processor pipeline. The energy saving software techniques
consist of:
• Software Managed Register Allocation
• Block-level Instruction List Scheduling
Software Managed Register Allocation: the CG-OoO register file hierarchy
consists of a Local Register File model that tracks registers whose lifetime is limited
to their own basic-block. These registers are allocated and managed by the compiler.
This saves energy by allowing the processor to bypass register renaming for local
register operands, and by reducing the pressure on the physical register file (a.k.a.
the Global Register File)
115
CHAPTER 8. CONCLUSION 116
Block-level Instruction Scheduling: each code block is list-scheduled by the
compiler to generate an e�cient code schedule to run on the CG-OoO processor.
The energy saving hardware techniques focus on reducing excessive accesses to
hardware units or reducing large table sizes into either smaller tables or partitioned
tables. The partitioned tables are distributed among the on-chip Block Window
clusters. The energy saving hardware techniques consist of:
• Block-Level Branch Prediction & Fetch
• Register File Hierarchy
• Register Renaming Bypass
• Dynamic Instruction Scheduling
• Block-Level Re-Order Bu↵er
Block-Level Branch Prediction & Fetch: the branch prediction unit, in this
study, decouples control speculation from Fetch and enables significant reduction in
branch predictor lookup events without additional fetch-stall cycles. As a result, the
branch predictor energy overhead is reduced and its prediction accuracy is improved.
Improved prediction accuracy is the results of fewer accesses to the BPU which in
turn reduce mis-prediction events due to aliasing.
Register File Hierarchy: the two register file structures used in this architecture
are known as the Local Register File (LRF) and Global Register File (GRF). Each
block-window holds a small LRF that is managed by the compiler and is replicated on
each Block Window. The small size of the LRF makes it an energy e�cient table to
access. The GRF is managed by the register rename unit and is physically partitioned
to reduce its access energy and increase its e↵ective port-count.
Register Renaming Bypass: only the global register operands require register
renaming. Bypassing the register rename stage reduces the energy footprint of this
stage.
CHAPTER 8. CONCLUSION 117
Dynamic Instruction Scheduling: the CG-OoO supports limited out-of-order
issue via two major techniques: (a) across code blocks, the presence of multiple in-
flight code blocks allows concurrent issue of instructions from di↵erent code blocks to
enable block-level parallelism, and (b) within each code block, the Skipahead schedul-
ing model allows limited and complexity e↵ective out-of-order instruction issue. The
combination of these two techniques (along with the compiler block-level list schedul-
ing) provide significant energy savings to this stage.
Block-Level Re-Order Bu↵er: unlike the out-of-order model, the CG-OoOmodel
maintains program order at block granularity. This reduces the ROB size by 10⇥ re-
ducing its energy overhead by the same amount. Squash events are also handled at
block granularity. Special provisions are made to ensure precise exception handling
is supported upon interrupt or exception events.
8.2 Future Research Directions
There are many di↵erent applications and extensions that can be applied to the CG-
OoO processor architecture and compiler.
Extension to Code Optimization: part of this research has been focused on
building a compiler infrastructure to improve energy e�ciency. While static com-
pilation helped improve the CG-OoO energy e�ciency and performance, there is
potential for performing profile-driven optimizations using, for instance, a dynamic
binary translation framework like the Nvidia Denver project [6, 13]. Extending the
CG-OoO processor to utilize profiling may enable the use of code blocks that com-
bine multiple basic-blocks into execution traces that deliver more instruction level
parallelism (ILP) and consume less energy.
Extension to Multi-threaded Applications: the CG-OoO processor presents an
execution model that delivers competitive performance with superior energy e�ciency.
This, however, is only the first step toward extending this execution model to the
CHAPTER 8. CONCLUSION 118
multi-threaded and multi-core space where further research opportunities for studying
the compiler and architecture requirements of modern parallel applications running
on the CG-OoO environment can be evaluated.
Extension to Mobile and Server Applications: this research focuses on eval-
uating the energy and performance impact of the CG-OoO processor on the SPEC
Int 2006 benchmark which consists of mostly scientific computing applications and is
known as one of the most challenging single-threaded benchmark suites. Extending
the evaluation of the CG-OoO on server benchmark applications will highlight addi-
tional architectural design features a CG-OoO server processor may require. Likewise,
studying mobile benchmarks will identify potential design extensions necessary for the
CG-OoO to be adopted in the mobile processor space.
Extension to Hardware Scaling: the clustered execution model design in the
CG-OoO processor enables two key benefits: (a) energy-performance proportionality,
and (b) scalable execution. The scalable design allows utilizing just as many Block
Window clusters as necessary to extract ILP. In case of a critical-path bound code,
for instance, the processor may not benefit from fetching too many code blocks when
none of them can contribute to the ILP. A Dynamic Resource Management Unit
that adjusts scheduling and execution resources to the runtime application ILP may
further improve the processor energy footprint without sacrificing performance.
Bibliography
[1] White paper: NVIDIA Charts Its Own Path to ARMv8. Technical report, Tirias
Research, 08 2014.
[2] Omid Azizi, Aqeel Mahesri, Benjamin C Lee, Sanjay J Patel, and Mark Horowitz.
Energy-performance tradeo↵s in processor architecture and circuit design: a
marginal cost analysis. In ACM SIGARCH Computer Architecture News, vol-
ume 38, pages 26–36. ACM, 2010.
[3] Ronald D Barnes, Shane Ryoo, and Wen-mei W Hwu. Flea-flicker multipass
pipelining: An alternative to the high-power out-of-order o↵ense. In Proceedings
of the 38th annual IEEE/ACM International Symposium on Microarchitecture,
pages 319–330. IEEE Computer Society, 2005.
[4] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali
Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh
Sardashti, et al. The gem5 simulator. ACM SIGARCH, 39(2):1–7, 2011.
[5] Sarah Bird, Aashish Phansalkar, Lizy K John, Alex Mericas, and Rajeev In-
dukuru. Performance characterization of spec cpu benchmarks on intels core
microarchitecture based processor. In SPEC Benchmark Workshop, 2007.
[6] Darrell Boggs, Gary Brown, Nathan Tuck, and K Venkatraman. Denver: Nvidia’s
first 64-bit arm processor. 2015.
[7] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for
architectural-level power analysis and optimizations, volume 28. ACM, 2000.
119
BIBLIOGRAPHY 120
[8] Doug Burger and Stephen W Keckler. 19.5: Breaking the gop/watt barrier with
edge architectures. In GOMACTech Intelligent Technologies Conference, 2005.
[9] Doug Burger, Stephen W Keckler, K ea McKinley, Mike Dahlin, Lizy K John,
Calvin Lin, Charles R Moore, James Burrill, Robert G McDonald, and William
Yoder. Scaling to the end of silicon with edge architectures. Computer, 37(7):44–
55, 2004.
[10] Keil Tool by ARM. ARM registers. http://www.keil.com/support/man/docs/
armasm/armasm_dom1359731128950.htm. [Online; accessed 24-August-2015].
[11] Kenneth Czechowski, Victor W Lee, Ed Grochowski, Ronny Ronen, Ronak Sing-
hal, Richard Vuduc, and Pradeep Dubey. Improving the energy e�ciency of big
cores. In Proceeding of the 41st annual international symposium on Computer
architecuture, pages 493–504. IEEE Press, 2014.
[12] Bill Dally. Project denver. Processor to usher in new era of comput-
ing.[Online] Available from http://blogs.NVIDIA.com/2011/01/project-denver-
processor-tousher-in-new-era-of-computing/[Accessed 2nd August 2012], 2011.
[13] Bill Dally. Project denver. Processor to usher in new era of computing.[Online]
Available from http://blogs. NVIDIA. com/2011/01/project-denver-processor-
tousher-in-new-era-of-computing/[Accessed 2nd August 2012], 2011.
[14] Subhasis Das, Tor M Aamodt, and William J Dally. Slip: reducing wire en-
ergy in the memory hierarchy. In Proceedings of the 42nd Annual International
Symposium on Computer Architecture, pages 349–361. ACM, 2015.
[15] Michael Ditty, John Montrym, and Craig Wittenbrink. Nvidias tegra k1 system-
on-chip. Hot Chips: A Symposium on High Performance Chips, 2014.
[16] Daniele Folegnani and Antonio Gonzalez. Energy-e↵ective issue logic. In ACM
SIGARCH Computer Architecture News, volume 29, pages 230–239. ACM, 2001.
[17] Linley Gwennap. Mips r12000 to hit 300 mhz. Microprocessor Report, 11(13):1,
1997.
BIBLIOGRAPHY 121
[18] Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. SimPoint 3.0:
Faster and more flexible program phase analysis. Journal of Instruction Level
Parallelism, 7(4):1–28, 2005.
[19] Eric Hao, Po-Yung Chang, Marius Evers, and Yale N Patt. Increasing the instruc-
tion fetch rate via block-structured instruction set architectures. International
Journal of Parallel Programming, 26(4):449–478, 1998.
[20] Andrew Hilton, Santosh Nagarakatte, and Amir Roth. icfp: Tolerating all-level
cache misses in in-order processors. In High Performance Computer Architec-
ture, 2009. HPCA 2009. IEEE 15th International Symposium on, pages 431–442.
IEEE, 2009.
[21] Andrew Hilton and Amir Roth. Bolt: energy-e�cient out-of-order latency-
tolerant execution. In High Performance Computer Architecture (HPCA), 2010
IEEE 16th International Symposium on, pages 1–12. IEEE, 2010.
[22] Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, et al. The microarchi-
tecture of the pentium R� 4 processor. In Intel Technology Journal. Citeseer,
2001.
[23] Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan,
Kevin Skadron, and Mircea R Stan. Hotspot: A compact thermal modeling
methodology for early-stage vlsi design. Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, 14(5):501–513, 2006.
[24] Wen-Mei W Hwu, Scott A Mahlke, William Y Chen, Pohua P Chang, Nancy J
Warter, Roger A Bringmann, Roland G Ouellette, Richard E Hank, Tokuzo
Kiyohara, Grant E Haab, et al. The superblock: an e↵ective technique for vliw
and superscalar compilation. the Journal of Supercomputing, 7(1-2):229–248,
1993.
[25] S Itr. Itrs 2012 executive summary. ITRS.[Online]. Available: http://www. itrs.
net/Links/2012ITRS/Home2012. htm.
BIBLIOGRAPHY 122
[26] Vinod Kathail, Michael Schlansker, and Bantwal Ramakrishna Rau. HPL Play-
Doh architecture specification: Version 1.0. Hewlett-Packard Laboratories, 1994.
[27] Richard E Kessler, Edward J McLellan, and David A Webb. The alpha 21264
microprocessor architecture. In Computer Design: VLSI in Computers and Pro-
cessors, 1998. ICCD’98. Proceedings. International Conference on, pages 90–95.
IEEE, 1998.
[28] K Khubaib, M Aater Suleman, Milad Hashemi, Chris Wilkerson, and Yale N
Patt. Morphcore: An energy-e�cient microarchitecture for high performance
ilp and high throughput tlp. In Microarchitecture (MICRO), 2012 45th Annual
IEEE/ACM International Symposium on, pages 305–316. IEEE, 2012.
[29] H-S Kim and James E Smith. An instruction set and microarchitecture for
instruction level distributed processing. In Computer Architecture, 2002. Pro-
ceedings. 29th Annual International Symposium on, pages 71–81. IEEE, 2002.
[30] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen,
and Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling
framework for multicore and manycore architectures. In Microarchitecture, 2009.
MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 469–
480. IEEE, 2009.
[31] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geo↵
Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: building
customized program analysis tools with dynamic instrumentation. ACM Sigplan
Notices, 40(6):190–200, 2005.
[32] Stephen W Mahin, Stephen M Conor, Stephen J Ciavaglia, Lyman H Moul-
ton III, Stephen E Rich, and Paul D Kartschoke. Superscalar instruction pipeline
using alignment logic responsive to boundary identification logic for aligning and
appending variable length instructions to instructions stored in cache, April 29
1997. US Patent 5,625,787.
BIBLIOGRAPHY 123
[33] Scott A Mahlke, David C Lin, William Y Chen, Richard E Hank, and Roger A
Bringmann. E↵ective compiler support for predicated execution using the hyper-
block. In ACM SIGMICRO Newsletter, volume 23, pages 45–54. IEEE Computer
Society Press, 1992.
[34] Daniel S McFarlin, Charles Tucker, and Craig Zilles. Discerning the dominant
out-of-order performance advantage: is it speculation or dynamism? In ACM
SIGPLAN Notices, volume 48, pages 241–252. ACM, 2013.
[35] Stephen Melvin and Yale Patt. Enhancing instruction scheduling with a block-
structured isa. International Journal of Parallel Programming, 23(3):221–243,
1995.
[36] Milad Mohammadi, Shuo Han, Tor Aamodt, and Barry Daly. On-demand dy-
namic branch prediction. 2013.
[37] Milad Mohammadi, Song Han, T Aamodt, and WJ Dally. On-demand dynamic
branch prediction.
[38] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N Patt. Runahead execu-
tion: An alternative to very large instruction windows for out-of-order processors.
In High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.
The Ninth International Symposium on, pages 129–140. IEEE, 2003.
[39] Subbarao Palacharla, Norman P Jouppi, and James E Smith. Complexity-
e↵ective superscalar processors, volume 25. ACM, 1997.
[40] Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. Marss: a full system
simulator for multicore x86 cpus. In Proceedings of the 48th Design Automation
Conference, pages 1050–1055. ACM, 2011.
[41] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and
Anand Karunanidhi. Pinpointing representative portions of large intel R�itanium R� programs with dynamic instrumentation. In Proceedings of the 37th
BIBLIOGRAPHY 124
annual IEEE/ACM International Symposium on Microarchitecture, pages 81–92.
IEEE Computer Society, 2004.
[42] Glenn Reinman, Todd Austin, and Brad Calder. A scalable front-end architec-
ture for fast instruction delivery. In ACM SIGARCH Computer Architecture
News, volume 27, pages 234–245. IEEE Computer Society, 1999.
[43] Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace pro-
cessors. In Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM
International Symposium on, pages 138–148. IEEE, 1997.
[44] Pierre Salverda and Craig Zilles. Fundamental performance constraints in hor-
izontal fusion of in-order cores. In High Performance Computer Architecture,
2008. HPCA 2008. IEEE 14th International Symposium on, pages 252–263.
IEEE, 2008.
[45] Daniel Sanchez and Christos Kozyrakis. Zsim: fast and accurate microarchi-
tectural simulation of thousand-core systems. In ACM SIGARCH Computer
Architecture News, volume 41, pages 475–486. ACM, 2013.
[46] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu
Kim, Jaehyuk Huh, Doug Burger, Stephen W Keckler, and Charles R Moore.
Exploiting ilp, tlp, and dlp with the polymorphous trips architecture. In Com-
puter Architecture, 2003. Proceedings. 30th Annual International Symposium on,
pages 422–433. IEEE, 2003.
[47] Michael S Schlansker and B Ramakrishna Rau. EPIC: An architecture for
instruction-level parallel processors. Hewlett-Packard Laboratories, 2000.
[48] Andre Seznec, Stephen Felix, Venkata Krishnan, and Yiannakis Sazeides. Design
Tradeo↵s for the Alpha EV8 Conditional Branch Predictor. In Proc. IEEE/ACM
Symp. on Computer Architecture (ISCA), pages 295–306, 2002.
[49] Gurindar S Sohi, Scott E Breach, and TN Vijaykumar. Multiscalar processors.
In ACM SIGARCH Computer Architecture News, volume 23, pages 414–425.
ACM, 1995.
BIBLIOGRAPHY 125
[50] S Peter Song, Marvin Denman, and Joe Chang. The powerpc 604 risc micropro-
cessor. IEEE Micro, (5):8–17, 1994.
[51] Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike
Upton. Continual flow pipelines. ACM SIGPLAN Notices, 39(11):107–119, 2004.
[52] Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin.
Wavescalar. In Proceedings of the 36th annual IEEE/ACM International Sym-
posium on Microarchitecture, page 291. IEEE Computer Society, 2003.
[53] Francis Tseng and Yale N Patt. Achieving out-of-order performance with almost
in-order complexity. In Computer Architecture, 2008. ISCA’08. 35th Interna-
tional Symposium on, pages 3–12. IEEE, 2008.
[54] Jessica H Tseng and Krste Asanovic. Banked multiported register files for high-
frequency superscalar microprocessors. ACM SIGARCH Computer Architecture
News, 31(2):62–71, 2003.
[55] Yasuko Watanabe, John D Davis, and David A Wood. Widget: Wisconsin
decoupled grid execution tiles. In ACM SIGARCH Computer Architecture News,
volume 38, pages 2–13. ACM, 2010.
[56] Robert G Wedig and Marc A Rose. The reduction of branch instruction exe-
cution overhead using structured control flow. In ACM SIGARCH Computer
Architecture News, volume 12, pages 119–125. ACM, 1984.
[57] Sam Likun Xi, Hans Jacobson, Pradip Bose, Gu-Yeon Wei, and David Brooks.
Quantifying sources of error in mcpat and potential impacts on architectural
studies. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st
International Symposium on, pages 577–589. IEEE, 2015.
[58] Matt T Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural
simulator. In Performance Analysis of Systems & Software, 2007. ISPASS 2007.
IEEE International Symposium on, pages 23–34. IEEE, 2007.
BIBLIOGRAPHY 126
[59] Ahmad Zmily and Christos Kozyrakis. Simultaneously improving code size, per-
formance, and energy in embedded processors. In Proceedings of the conference
on Design, automation and test in Europe: Proceedings, pages 224–229. Euro-
pean Design and Automation Association, 2006.