Upload
annick
View
46
Download
1
Embed Size (px)
DESCRIPTION
Codesigned Virtual Machines Part . 2006. 10. 18 Yu, Young Jin DCSLAB. Contents. Introduction Case Study (1) Transmeta Crusoe Case Study (2) IBM AS/400. Applying Codesigned VMs. Advantages(performance, power efficiency, flexibility) can be achieved, - PowerPoint PPT Presentation
Citation preview
Codesigned Virtual MachinesPart <II>
2006. 10. 18Yu, Young Jin
DCSLAB
Contents• Introduction• Case Study (1)
– Transmeta Crusoe• Case Study (2)
– IBM AS/400
Applying Codesigned VMs
• Advantages(performance, power efficiency, flexibility) can be achieved,– At the macro level: entirely new ISAs
• VLIW: Transmeta Crusoe, IBM Daisy/BOA• OO source ISA: IBM AS/400
– At the micro level• The implementation of specific performance enhan
cement• Instructions reordering, …
Case Study (1):
Transmeta Crusoe
Introduction• In Jan. of 2000, Transmeta Corp. introduce
d the Crusoe processors.– Remarkably low power consumption
• As might not be expected, The new technology is fundamentally software-based.– The power savings come from replacing large n
umbers of transistors with software.
The Crusoe Processor• Consists of a hardware engine logically sur
rounded by a software layer.– H/W: The engine
• is a VLIW CPU capable of executing up to four operations in each clock cycle.
• No resemblance to the x86 instruction set.
– S/W: Code Morphing Software(CMS)• Dynamically “morphs” x86 instructions into VLIW in
structions
The Crusoe Processor
• CMS technology changes the entire approach to designing microprocessors.– Demonstrate practical microprocessors can
be implemented as HW-SW hybrids.– Expanded the design space– Development teams may enlist software
experts, working in parallel with hardware engineers to bring products to market faster.
The Crusoe Processor
Technology Perspective• Decoupled the x86 ISA from the underlying
processor hardware.– Each new CPU design only requires a new version
of the Code Morphing software to translate x86 instructions to the new CPU’s native instruction set.
• Because the CMS would typically reside in standard Flash ROMs on the motherboard, improved versions can even be downloaded into processor in the field.
x86 vs. Crusoe
Crusoe Processor Fundamentals
• VLIW engine– Two integer units, a floating point unit, a memory(stor
e/load) unit, a branch unit– Molecule: a long(64 or 128bits) instruction word conta
in up to four RISC-like instructions, called atom.– All atoms within a molecule are executed in parallel, a
nd the molecule format directly determines how atoms get routed to functional units.
• This greatly simplifies the decode and dispatch hardware.
Crusoe Processor Fundamentals
• The integer register file– Has 64 registers, %r0 through %r63– CMS allocates some registers to hold
x86 state while others contain state internal to the system, or can be used as temporary registers.
Crusoe Processor Fundamentals
• To keep the processor running at full speed, molecules are packed as fully as possible with atoms.
Conventional superscalar…
• This type of processor hardware is much more complex than the Crusoe processor’s simple VLIW engine.
Code Morphing Software• CMS
– Is fundamentally a dynamic translation system
– In this case, x86 ISA -> VLIW ISA– “x86 ISA” is the only thing x86 code
sees. • The only program written directly for the
VLIW engine is the Code Morphing Software itself.
Hierarchy
Hierarchy
Crusoe’s VLIW instr. Scheduling
Code Morphing Software
CMS Memory Layout
CMS: Drawing the HW-SW line• Choosing which functions to
implement in HW and which in SW is a major engineering challenge– Involving issues such as cost and
complexity, overall performance and power consumption
– For example, The HW-SW line might be drawn differently for a high-end server processor.
CMS: Decoding and Scheduling
• Code Morphing can translate an entire group of x86 instructions at once, – Whereas a superscalar x86 translates single
instructions in isolation.
• The Code Morphing approach can amortize the cost of translation over many executions.– Allowing it to use much more sophisticated
translation and scheduling algorithm.
CMS: Caching• The translation cache resides in a separate
memory space that is inaccessible to x86 code.
• As an application executes,– Code Morphing “learns” more about the program
and improves it so will execute faster and faster.
• Some benchmarks do not accurately predict the performance of Crusoe processor!!
CMS: Filtering• The translation system needs to
– Choose carefully how much effort to spend on translating and optimizing a given piece of x86 code.
• A wide choice of execution modes– Interpretation only(no translation)– Simple-mined code generation– Highly-optimized code generation
CMS: Prediction and Path Selection
• CMS can gather feedback
– Instrumentation profiling• The translator adds code to collect info.
– This data can be used later to decide when and what to optimize and translate.• For example, if a given branch is highly
biased,…
CMS: Making a Translation
Front end
Well-knownoptimizations
Scheduling
The molecules explicitly encode the instruction-level parallelism, hence they can be executed by a simple VLIW engine.
HW Support for Code Morphing• Exceptions • “precise exception” problemtrap
“too soon”
* Solution: Use Shadow Register !
HW Support for Code Morphing• All registers holding x86 state are shadowe
d. (working/shadow copy)– Normal atoms only update the working copy of t
he register.– “commit” operation: working -> shadow regs.– “rollback” operation: shadow -> working regs.
• Undoing changes to memory– Holding store data in a “gated store buffer”– Commit / rollback
HW Support for Code Morphing• Alias Hardware
– When the translator moves a load operation ahead of a store operation,
– it converts the load into a load-and-protect and the store into a store-under-alias-mask.
– Always safe to reorder memory ld/stores.
HW Support for Code Morphing• Alias Hardware
<Original Code>
St 0(r1), r2…Ld r3, 0(r4)…St 0(r5), r6…Ld r7, 0(r8)Add r9, r3, r7
<Rescheduled Code> - UnsafeLd r3, 0(r4)Ld r7, 0(r8)St 0(r1), r2……St 0(r5), r6…Add r9, r3, r7
<Rescheduled Code> - ProtectedLdp r3, 0(r4) xLdp r7, 0(r8) x xStam 0(r1), r2……Stam 0(r5), r6…Add r9, r3, r7
* The ldp/stam pair is an excellent example that illustrates the interplay between the codesigned hardware and software in a codesigned VM.
HW Support for Code Morphing• Coping with Self-Modifying Code
– X86 inst. in memory get overwritten, either• Because OS is loading a new program, or• Because an application is using self-modifying
code.– When this happens to code that has
already been translated,• The CMS needs to be notified to keep it from
erroneously executing a translation for the old code.
HW Support for Code Morphing• Coping with Self-Modifying Code
– Whenever the system translates a block of x86 code, it write-protects the page.• It does so by setting a dedicated
“translated” bit in that page’s entry in the processor’s memory management unit.
• That bit is invisible to x86 software.– When a protected page is written to, the
simplest remedy is to invalidate the affected translations.
Example: A complex translation
Case Study (2):
IBM AS/400
From IBM’s homepage…• The accelerating rate of change of
both hardware and software technologies necessitates that the system you select has been designed with the future in mind.– “We believe that the IBM AS/400 will be
the number one choice !”
Introduction• The design of AS/400 insulates app
programs from changing hw characteristics through the layer of microcode.– The interface: TIMI– The microcode layer: LIC
• In 1995, AS/400 changed its processor technology ( CISC -> 64bit RISC )– No recompiling/rewriting– Not only did they run, but they were fully 64-bit
programs.
AS/400 architecture
TIMI layer separates the hw and LIC from OS
Instructions are translated to a specific hw instruction set as part of the backend of the compilation process.
AS/400 architecture• TIMI is a virtual instruction set.
– All user-mode programs are stored as TIMI instructions.
– Conceptually somewhat similar to the VM architecture of programming env such as Smalltalk, Java and .NET
– Stored within the final program object– Object-based ISA
Memory Architecture• The TIMI has a memory architecture
composed of objects.– The objects are completely isolated from
one another and can only be accessed via pointers.
– Actual address values contained in pointers are not made visible to SW above TIMI.
– The implementation of the object-based memory is done entirely below the TIMI.
Memory Architecture• Protecting the integrity of pointers is an es
sential part of any Object-Based system.– The object pointers are encoded in 128bits.
• Upper 64 bits: type info, authorization, …• Lower 64 bits: 64-bit PowerPC virtual addr.
– Significant extension to PowerPC mem.arch.• Adding of protection for object pointers
– Load/Store-pointer instruction.– 65th bit for indicating whether the location contains a poin
ter
Instruction Set• TIMI instruction format
• Multiway conditional branch– This is the “architected representation”– It is translated to an impl-dependent form, and it doe
s the work of multiple RISC instructions.
opcodeopcodeextend
operand1 … operandN dest1 … dest4
2 bytes 2 bytes 3 bytes 3 bytes 3 bytes 3 bytes
(optional) (optional) (optional) (optional) (optional)
Addn & branch Eq 0 Gt 0 0 0 sum addend
1addend
2 dest1 dest2
Instruction SetInstr. addn 34 32 31 muln 36 34 37 Instr.
… const Binary(2) Binary(2) Binary(4) const …
1 31 32 33 34 35 36 37
ODT DirectionVector
4 A 2 3 … 1 3 D F …ODT EntryString
• Add numeric and multiply numeric, are generic• Entries in the ODT indicate the types of operands and the data flow.• The actual storage locations: after the TIMI is translated
Input/Output• The presence of IOPs simplifies the task of
pushing the device-dependent aspects out of the central processor.
Input/Output• At the level of TIMI,
– There is no secondary(disk) storage; rather it is part of the unified mem architecture.• All disk management SW, drivers, etc. exist in the i
mpl-dependent part of the system.
• The OS interacts with SW below the TIMI level(and with I/O devices)– through instructions that operate on the TIMI-le
vel objects.
Input/Output• TIMI-Supported Objects
– Access group, Context, …– Authorization List, User Profile, …– Dictionary, Index, …– Queue, Mode descriptor, …– Logical unit descriptor, …– Module, Program, …
Code Translation & Concealment
• HLL -> Template(TIMI + ODT) -> Program Object• The contents of the program object cannot be dir
ectly observed above the TIMI level.• Materialization
– Giving back to the user in the original, machine-independent form
– The platform switch is transparent to the user.
Code Translation & Concealment
Space objectHLL
Program
Progm. object
Compiler
Space object
<template>TIMI,ODT
Program Object
<template>TIMI,ODT
Impl-dependentExecutable
code
Create program source result
TIMI Level
Translator