The Limits of Semiconductor Technology & Coming Challenges in Microarchitecture and Architecture

The Limits of Semiconductor The Limits of Semiconductor Technology amp Coming Technology amp Coming

Challenges in Challenges in Microarchitecture and Microarchitecture and

ArchitectureArchitecture

Mile Stojčev Teufik Tokić Ivan Milentijević

Faculty of Electonic Engineering Niš

OutlineOutline

bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education

Outline ndash Technology TrendsOutline ndash Technology Trends

bullMoorersquos Law 1Moorersquos Law 1bullMoorersquos Law 2Moorersquos Law 2bullPerformance and New Technology GenerationPerformance and New Technology GenerationbullTechnology Trends ndash ExampleTechnology Trends ndash ExamplebullTrends in FutureTrends in FuturebullProcessor TechnologyProcessor TechnologybullMemory TechnologyMemory Technology

Moores Law 1Moores Law 1In 1965 Gordon Moore director of research and development at Fairchild Semiconductor later founder of Intel corp wrote a paper for Electronics entitled ldquoCramming more components onto integrated circuitsrdquo In the paper Moore observed that ldquoThe complexity for minimum component cost has increased at a rate of roughly a factor of two per yearrdquo

This observation became known as Moores law

In fact by 1975 the leading chips had maybe one-tenth as many components as Moore had predicted The doubling period had stretched out to an average of 17 months in the decade ending in 1975 then slowed to 22 months through 1985 and 32 months through 1995 It has revived to a now rel atively peppy 22 to 24 months in recent years

Moorersquos Law 1 continueMoorersquos Law 1 continueSimilar exponential growth rates have occurred for other aspects of computer technology ndash disk capacities memory chip capacities and processor performance These remarkable growth rates have been the major driving forces of the computer revolution

Capacity Speed (latency)Logic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years

Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors

Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)

Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes

Moores Law in ActionMoores Law in Action

The number of transistors on chip doubles annually

Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor

Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM

Improving frequency via pipeliningImproving frequency via pipelining

Process technology and microarchitecture innovations enable doubling the frequency increase every process generation

The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases

Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used

Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)

In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2

Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially

The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now

Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue

For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed

Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools

Doubles Every Four YearsDoubles Every Four Years

Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000

Metcalfersquos LawMetcalfersquos Law

A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared

Wirthrsquos LawWirthrsquos Law

Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating

Performance and new Performance and new technology generationtechnology generation

According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity

The increase in component per chip comes from following key factors

The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness

Development in ICsDevelopment in ICs

Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary

for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175

total transistorschiptotal transistorschip

0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005

1997 1999 2001 2003 2006 2009 2012Technology (micron)year

No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip

Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines

Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range

For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed

0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)

Global clock freq (MHz) Local clock freq (MHz)

Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency

Technology Trends - ExampleTechnology Trends - Example

As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly

Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000

SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml

In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline

Future size versus time in Future size versus time in silicon ICssilicon ICs

The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law

The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available

Trends in future size over time

Processor technology todayProcessor technology today

The most advanced processor technology today (year 2003) is 010 m=100nm

Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)

With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power

Processor Technology and Processor Technology and MicroprocessorsMicroprocessors

Process technology is the most important technology that drives the microprocessor industry

It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years

Microarchitecture attempts to increase both IPC and frequency

Process technology and Process technology and microarchitecturemicroarchitecture

Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)

Pipelining as microarchitecture idea help to increase frequency

Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program

Frequency and performance Frequency and performance improvementsimprovements

While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages

With frequencies higher than 1 GHz more than 20 pipeline stages are used

0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power

Performance of memory and CPUPerformance of memory and CPU

Memory in computer system is hierarchically organized

In 1980 microprocessors were often designed without caches

Nowadays microprocessors often come with two levels of caches

Memory HierarchyMemory Hierarchy

Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986

Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed

Relative processormemory speedRelative processormemory speed

Type of MemoriesType of Memories

MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE

Power off contents lost

NON VOLATILE

Power off contents kept

Percentage of UsagePercentage of Usage

Typical Applications of DRAMTypical Applications of DRAM

An anecdoteAn anecdote

In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired

IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed

Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse

An anecdote - continueAn anecdote - continue

If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams

All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy

If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns

Memory systemMemory system

In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance

To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components

Expensive Memory Expensive Memory Called a CacheCalled a Cache

A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data

A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory

Two Levels of CachesTwo Levels of Caches

Most advanced microprocessors today employ two levels of caches on chip

The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses

The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level

Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance

Of ndash chip memory access may elapse about 100 cycles

The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance

Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution

As conclusion concerning As conclusion concerning memory - problemsmemory - problems

Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data

There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit

The real design action is in memory subsystems ndash caches busses bandwidth and latency

As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue

If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close

On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors

Memory Hierarchy Memory Hierarchy SolutionsSolutions

Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two

System level parameters most affect performancea) The number of independent channels and banks connecting

the CPU to the DRAMs can effect a 25 performance change

b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change

c) Magnetic RAM (MRAM) ndash new type of memory

Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)

Based on nanoscale semiconductor technology

Nanotechnology RAM device consists of tiny Carbon nanotubes

Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them

MRAM CapacityMRAM Capacity

The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer

MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)

The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer

MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon

Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM

100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM

Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM

The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year

100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]

2 fold density per year FLASH

DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)

Overall memory prediction roadmapOverall memory prediction roadmap

Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years

High-density memory growth will surpass the prediction from Moores Low

Overall memory prediction roadmap -contOverall memory prediction roadmap -cont

1988 Computer Food Chain1988 Computer Food Chain



M a infra m e

Sup e rc o m p ute r

OutlineOutline


Outline - Low Power DesignOutline - Low Power Design

bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point

During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh

bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production

bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh

Power consumptionPower consumption

bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls

Typical Low-Power Applications

ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)

ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo

95908580001

01

1

10

100P

ow

er (

W)

x4 3years

Power dissipation in timePower dissipation in time

Gloom and Doom predictionsGloom and Doom predictions

Power density will increasePower density will increase

Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]

VDD Power and Current TrendVDD Power and Current Trend

1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage

International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)

( Taken from Sakurairsquos ISSCC 2001 presentation)

Power Delivery Problem (not just Power Delivery Problem (not just California)California)

Your carstarter

Power Consumption New Power Consumption New Dimension in DesignDimension in Design

Sources of Power Sources of Power ConsumptionConsumption

The three major sources of power consumption in digital CMOS circuits are

21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P

where

P1 ndash capacitive switching power (dynamic - dominant)

P2 ndash short circuit power (dynamic)

P3 ndash leakage current power (static)

P4 ndash static power dissipation (minor)

+ P4

Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design

Psw = pt CL V2

dd fCLK

Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution

Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop

Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling

Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout

Reducing the Power Reducing the Power DissipationDissipation

The power dissipation can be minimized by reducing

supply voltageload capacitanceswitching activity

ndash Reducing the supply voltage brings a quadratic improvement

ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed

Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation

Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage

06 30 50

Supply voltage [ V ]

Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1

bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed

bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing

Needs for Low-Power

Low-Power Design Low-Power Design Techniques Techniques

The basic idea is

Decreasing activity of the some parts within VLSI IC

The term power manager refer to such techniques in general

Applying power management to a design typically involves two steps

a) identifying idle or low active conditions for various parts of the circuit and

b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components

a) Reduction in fCLK is an option acceptable when

some components may be idle or low-active during operation

b) Reduction in Vdd is the most effective way for

power reduction since the power is proportional to the square of Vdd The problem with reducing

Vdd is that it leads to an increase in circuit delay

c) The product ptCL is called the average switched

capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level

General Approaches to Reduce PowerGeneral Approaches to Reduce Power

Low Power and Low Energy System DesignLow Power and Low Energy System Design

higher impactmore options

AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel

SystemLevel Design partitioning Power Down

Complexity Concurrency LocalityRegularity Data representation

Voltage scaling ParallelismInstruction set Signal correlations

Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching

Threshold Reduction Multi-threshold

The design of low power circuits can be tackled at different levels from system to technology

Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power

Less aggressive approach is which attracts more attention

This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency

phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage

clock distribution ampfrequency multiplier

logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based

Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power

DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock

activated deactivated

D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk

Clock distributionClock gating

- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module

Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage

bull Multiple supply voltage on the chip as less aggressive approach is attracting attention

bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)

bull This scheme tends to result in smaller area overhead compared to parallel architectures

System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power

Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components

RUN

SLEEPIDLE~90s

~10s 160ms

Wait for interrupt Wait for wake-up event

P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM

Observations Commands

Power Manager Power State Machine

Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics

12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op

Power breakdown Dynamic instruction statistics

Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath

Parallel DatapathParallel Datapath

The More Parallel the BetterThe More Parallel the Better

Pipeline DatapathPipeline Datapath

Architecture Summary for a SimpleArchitecture Summary for a Simple

OutlineOutline


Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations

bullFirst generation 1971-78

bullSecond Generation 1979-85

bullThird Generation 1985-89

bullFourth Generation 1990-

ndashBehind the power curve

ndashBecoming ldquorealrdquo computers

ndashChallenging the ldquoestablishmentrdquo

ndashArchitectural and performance leadership

The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure

The First Generation 1971-78The First Generation 1971-78

Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits

ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC

Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502

Intel 4004Intel 4004 First general-purpose

single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit

implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation

in 1972ndash 3500 transistorsndash First microprocessor-based

computer (Micral) Targeted at laboratory

instrumentation Mostly sold in Europe

Intel 8080Intel 8080 Intelrsquos first 16-bit architecture

ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS

Used in Altair 8800 system ndash Kit form (advertised in Popular

Electronics) in 1975 $297 or $395 with case 256 bytes of memory

expandable to 64K Keyboard and floppy 100-line bus becomes S-100

first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one

Homebrew Computer Club

Intel 8086Intel 8086

Introduced in 1978ndash Performance lt 05 MIPS

New 16-bit architecturendash ldquoAssembly languagerdquo

compatible with 8080ndash 29000 transistorsndash Includes memory protection

support for FP coprocessor In 1981 IBM introduces

PC ndash Based on 8088--8-bit bus

version of 8086

Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers

ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors

Transistors gt50000 Performance lt= 1 MIPS Processors

ndash Motorola 68000 68020ndash Intel 80286 80386

Motorola 68000Motorola 68000 Major architectural step in

microprocessorsndash First 32-bit architecture

initial 16-bit implementation

ndash First flat 32-bit address Support for paging

ndash General-purpose register architecture

Loosely based on PDP-11

First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS

Used inndash Apple Macndash Sun Silicon Graphics amp Apollo

workstations

Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo

ndash Microprocessors surpass minicomputers in performance rival mainframes

ndash Implementation technology of choice all new architectures are microprocessors

ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors

ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC

MIPS R2000MIPS R2000 Several firsts

ndash First RISC microprocessor

ndash First microprocessor to provide integrated support for instruction amp data cache

ndash First pipelined microprocessor (sustains 1 instructionclock)

Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS

Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership

ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches

Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors

ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC

Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium

Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year

ndash True from 1985-present Combination of technology and architectural

enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors

Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth

Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism

Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW

ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches

ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching

First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB

Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies

early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)

ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early

Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by

ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock

superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)

1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes

ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction

ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path

MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches

ndash On-chipndash Support for off-chip secondary

cache Integrated floating point Implemented in 1991

ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS

Intel i860Intel i860

First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass

Implemented in 1991ndash 13M transistorsndash 50 mips

Used primarily as attached processor (eg graphics)

MIPS R10000MIPS R10000

First speculative processorndash Instruction scheduled and

executed out-of-orderndash Up to 4 instructions can

complete per clockndash Window of 32 instructions

(up to 32 in-flight)ndash Maintain precise state by

completing instructions in order

Implemented in 1996ndash 68M transistorsndash 200 MHz

Intel IA-64 and ItaniumIntel IA-64 and Itanium

EPIC architecturendash Use compiler centric approach

while avoiding disadvantagesndash Parallelism demarcated by the

compilerndash Many special instruction amp

features for exploiting ILP in the compiler

Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts

Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware

Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu

Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation

Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation

Wide variety of approaches both hardware and compiler intensive

Lower hardware complexity

More longer range analysis

More machine dependence




More stable performance

Higher complexity

Potential clock rate impact


Higher complexity


No clear cut winners at the present

Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems

Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view

bullNo performance wall but steeper slopes ahead

bullEasier territory is behind us

bullIndustry-research gap vanished

bullEnergy efficiency may be key limit

ILPMountai

n

Multilevelcaches amp buffers

Critical word amp early restart

Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

Microprocessors today where Microprocessors today where they are and what can dothey are and what can do

Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level ()

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Microprocessors where they goMicroprocessors where they go

Intel more TransistorIntel more Transistor

Intel Faster DevicesIntel Faster Devices

Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors

Higher level parallelismHigher level parallelism

Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency

The more prononuced are

a) simultaneons multithreaded (SMT) processor and

b) chip multiprocessors (CMT)

MultithreadingMultithreading

Microprocessor can execute multiple operations at a time4 or 6 operations per cycle

Hard to achieve this level of parallelism from single program

Can we run multiple programs (threads) on (single) processor without much effort

Simultaneous multithreading (SMT) or Hyperthreading is a solution

Parallel Thread Sequencing ModelParallel Thread Sequencing Model

Principles of SMTPrinciples of SMT

Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors

Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)

Support for 2-4 threads but expect to get only 13X improvement in throughput

Chip MultiprocessorChip Multiprocessor

Several processor cores in one die

Shared L2 caches

Chip Communication to build multichip module with many CMPs + memory

Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model

CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner

The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)

The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system

Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue

CMP is an atractive option to use when moving to a new process technology such as SoC

Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc

MPSoCs are usually implemented as heterogenous systems

CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors

Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor

4 ndash 8 general-purpose processing engines on chip used to execute independent programs

Explicitly parallel programs (when possible) Speculatively parallel threads

Special-purpose processing units (eg DSP functionality)

Elaborate memory hierarchyElaborate inter-chip communication facilities

Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures

Characteristic SuperscalarSimultaneousmultithreading

Chipmultiprocessor

Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13

The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure

OutlineOutline


Outline Outline Challenges in EducationChallenges in Education

bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept

ChalengeChalengess in Education in Education

It has often said that Where you stand depends on where you sit

In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers

How to organize a training of new How to organize a training of new engineers engineers

The engineers we are training today will still be practicing 40 years from now

Are we preparing them for what they will be doing then

Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years

We think not on both counts

Our view amp our experienceOur view amp our experience

Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up

Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields

Changes in curriculaChanges in curricula

It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications

But the fact is that the practice of engineering is changing at about the same pace as the technology it creates

What are fundamentalsWhat are fundamentals

The undergraduate curriculum should teach (only) fundamentals

Everyone agrees with that

But what are fundamentals

Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics

But as we said earlier engineering is changing

What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples

Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental

Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering

Kinds of FundamentalsKinds of Fundamentals

Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental

Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals

How to add these new fundamentalsHow to add these new fundamentals

The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full

We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly

What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering

education look like in the future education look like in the future

It is difficult to predict the future with any accuracy but it is safe to say that

Web-based teachingdistance learningelectronic books and

interactive learning environments

will play increasingly significant roles in shaping what we teach how we teach and how students learn

A sort of challenge we should acceptA sort of challenge we should accept

During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time

That is the sort of challenge we should accept for improving engineering education

OutlineOutline









































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

















































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline















































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline










































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline









































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline








































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline







































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline




































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline



































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline
































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline































































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline



























































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

























































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline























































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline






















































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline




















































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


















































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline














































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































0

200

400

600

800

1000

1200

1400

1600

025 018 015 013 01 007 005


No

of

tra

ns

isto

rs (

mill

ion

s) Transistorschip




0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































0

2000

4000

6000

8000

10000

12000

1997 1999 2001 2003 2006 2009 2012


Fre

qu

ency

(M

Hz)



























0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline



































































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


































































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline































































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





























































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


























































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

























































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





















































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

















































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































0

2

4

6

8

10

12

14

16

18

20

Pipeline Depth

Rela

tive I

mp

rovem

en

t

Frequency

CPI

Performance

Power










MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline



















































MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline















































MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline














































MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































MOS memories

RAMs ROMs

DRAMSRAM ROM

FLASHEEPROMEPROM

VOLATILE


NON VOLATILE

















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

























































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline
























































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline























































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline



















































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline















































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline









































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

































































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





























































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


























































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline




















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline
















































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































100nm

90nm

55nm

70nm

0512Mb

2Gb

4GbMLC(2bitscell)

SLC

1Gb

4Gb

8Gb

16Gb

MLC(3bitssell)

0

20

40

60

80

100

120

2003 2005 2010

Des

ign

Ru

le [

nm

]

01

1

10

100

Den

sity

[G

b]

FLASH

DRAM



100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































100

4Gb

2Gb

1Gb

0512Mb

12Gb

4Gb

2Gb

1Gb

01

1

10

100

2000 2005 2010

De

ns

ity

[G

b]


DRAM

Moores Law

MLC (3bitscell)

MLC (2bitscell)








M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

















































M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline














































M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































M a infra m e


OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline










































OutlineOutline












95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline




















































95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


















































95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline














































95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































95908580001

01

1

10

100P

ow

er (

W)

x4 3years




Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline










































Year

Vo

ltag

e [V

]

Po

wer

per

ch

ip [

W]

VD

D c

urr

ent

[A]


1998 2002 2006 2010 20140

05

1

15

2

25

0 0

200 500

Current

Power

Voltage




Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































Your carstarter





where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline














































where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































where





+ P4


Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































Psw = pt CL V2

dd fCLK












06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

















































06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































06 30 50


Ga

te d

elay

[n

s]

(no

rma

lize

d)

Po

we

r d

issi

pat

ion

[ W

](n

orm

ali

zed

)

1

1 02 5

1



Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































Needs for Low-Power


The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































The basic idea is
















AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline




















































AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































AlgorithmLevel

ArchitectureLevel

CircuitLevel

Process DeviceLevel










f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4

f11 f12

PLL 1 DLL 1

PLL

fCLK

f21 f21

PLL 2 DLL 2

f31 f32

PLL 3 DLL 3

f41 f41

PLL 4 DLL 4


phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































phasedetector

curentpump

divider by N

up

down

CLKREF

CLKFB

VCO

loopfilter

digital system

regulated voltage


logic

phasedetector

curentpump

up

down

CLKREF

CLKFBloopfilter

digital system

f1 2f1 nf1

control

VCDL

TVCDL

DC1 DC2 DCn

in out

PLL based

DLL based


DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































DFF

D

C

Q

enable

clock

gated-clock

target flip-flops

latch

clock

enable

gated-clock


D C

BA

Enable_BEnable_A

Enable_C

PLL(Clk generator)

Clk









RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline
















































RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































RUN

SLEEPIDLE~90s

~10s 160ms


P=400mW

~90s~10s

P=50mW P=016mW

OBSERVER CONTROLLER

Workloadinformation

Power Manager

SYSTEM




12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































12

16

25

43

Clock

Memory

Control IODatapath

1513

51

23

43

Compare op

Logical op

Others

Data Move

Control Flow

Arithmetic op







OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline















































OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline














































OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline













































OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline










































OutlineOutline



































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


































































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

































































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





























































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline























































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline
















































version of 8086













workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline






















































workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


















































workstations



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline







































































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline


































































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





























































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline
























































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline

















































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline







































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline



































































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline




























































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline






















































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline





















































Higher complexity



Higher complexity




Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline











































Simplepipelining

Scheduledpipelines

Multipleissue

Dynamicschedulin

g

Speculation

My view





ILPMountai

n



Compilerprefetchin

g

Multipathprefetchin

g

Simplecaches

CacheMountai

n

Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline










































Per

form

ance

01

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors


Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline










































Tran

sist

ors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000 2005


i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000






















Shared L2 caches


















Chipmultiprocessor



OutlineOutline






























































Shared L2 caches


















Chipmultiprocessor



OutlineOutline





























































Shared L2 caches


















Chipmultiprocessor



OutlineOutline




























































Shared L2 caches


















Chipmultiprocessor



OutlineOutline



























































Shared L2 caches


















Chipmultiprocessor



OutlineOutline






















































Shared L2 caches


















Chipmultiprocessor



OutlineOutline

















































Shared L2 caches


















Chipmultiprocessor



OutlineOutline
















































Shared L2 caches


















Chipmultiprocessor



OutlineOutline















































Shared L2 caches


















Chipmultiprocessor



OutlineOutline












































Shared L2 caches


















Chipmultiprocessor



OutlineOutline


























































Chipmultiprocessor



OutlineOutline






















































Chipmultiprocessor



OutlineOutline

















































Chipmultiprocessor



OutlineOutline












































Chipmultiprocessor



OutlineOutline











































OutlineOutline










































OutlineOutline





































































































































































































































































































Documents

The Limits of Semiconductor Technology & Coming Challenges in Microarchitecture and Architecture