Upload
trinhkien
View
218
Download
3
Embed Size (px)
Citation preview
© 2013 IBM Corporation
The Evolution of the Hardware/Software Interface
Marcel Mitran - Lead Architect, IBM Java on ZNovember 19th, 2013
L3C
0L3
C 1
GX
MC
U
Core0
Core1
Core2
Core3 Core5
Core4
L3C
0L3
C 0
L3C
0L3
C 0
L3C
0L3
C 0
L3C
0
L3 Data L3 Data
cop cop cop
cop cop cop
© 2013 IBM Corporation2
Who am I, and What Do I Do
School
� McGill B’Eng ECE class of ‘97
� McGill M’Eng CIM class of ’01 (computer vision)
Joined IBM Sept 10 th, 2001
� IBM Master Inventor and Senior Technical Staff Member (STSM)
� Lead Architect IBM Java on System z– Corporate-wide responsibility for developing IBM’s JDK on System z
� World-wide technical leader for compilers developme nt for System z– Hardware/Software interface
Work closely with:
� JVM (Ottawa Lab), C/C++, COBOL (Silicon Valley Lab), XML compiler teams
� Hardware designers (Power, System z, Intel, AMD…)
� O/S developers (AIX, z/OS, Linux…)
� Middleware developers (Websphere, DB2, Tivoli)
� Research (Tokyo Research Lab, Watson Research Lab)
© 2013 IBM Corporation3
Did you say Mainframe!?!?
System z run applications that run my life- Used by 95% of the Fortune 500 Companies- 80% of corporate data resides or originates on mainframes- Runs everything from your class registration to airplane reservations- 2/3 of business transactions for US retail banks run directly on mainframes
System z is expandable , adaptable and flexible- Add capacity and software updates without a reboot- Respond automatically to spikes in workload demands- Align processing priorities with business priorities
System z is dependable and available- Less than 5 minutes down time per year- Mean time to failure is 40 years
System z is secure- Highest level of industry security rating, EAL5, awarded to the mainframe
System z is virtualized- Create a new Linux image in as little as 90 seconds
The return of the mainframe: Back in fashionJan 14th, 2010
Toni Sacconaghi of Bernstein Research estimates that 40% of IBM’s profits are mainframe-related.
© 2013 IBM Corporation4
Classical View of the Hardware/Software Interface
� Classical view of the h/w + s/w interface (at least with my 1st Edition copy of the text)
– Data representation– Instruction architecture– Pipelining– Memory hierarchy
� Little to no reference to– Multi-core/SMT– Dynamic compilation/managed runtimes– I/O attached accelerators– domain-specific devices/languages– distributed computing– parallel programming models– etc
� Up-until ~5 years ago, largely ok..
© 2013 IBM Corporation5
zEnterprise EC12 Hardware – Available since Sept 2012
Continued aggressive investment in H/W + S/W co-design
Hardware Transaction Memory (HTM)– Better concurrency for multi-threaded/parallel applications– Fine-grained concurrency
Run-time Instrumentation (RI)– Real-time feedback on dynamic program characteristics– Enables increased optimization by Java
2GB page frames– Improved performance targeting 64-bit heaps
Page-able 1MB large pages using Flash Express – Better versatility of managing memory
Shared-Memory-Communication– RDMA over Converged Ethernet
zEnterprise Data Compression accelerator– FPGA-based acceleration of gzip
Misc new instructions– Software hints directives– Traps– Decimal conversion
System Specs– 120-way machine– 3TB of real storage– IBM zAware – autonomic monitoring– Common Criteria Evaluation
Assurance Level 5+ (un-matched)– IBM DB2 Analytics Accelerator
© 2013 IBM Corporation6
zEnterprise EC12 Hardware – Available since Sept 2012
Continued aggressive investment in H/W + S/W co-design
Hardware Transaction Memory (HTM)– Better concurrency for multi-threaded/parallel applications– Fine-grained concurrency
Run-time Instrumentation (RI)– Real-time feedback on dynamic program characteristics– Enables increased optimization by Java
2GB page frames– Improved performance targeting 64-bit heaps
Page-able 1MB large pages using Flash Express – Better versatility of managing memory
Shared-Memory-Communication– RDMA over Converged Ethernet
zEnterprise Data Compression accelerator– FPGA-based acceleration of gzip
Misc new instructions– Software hints directives– Traps– Decimal conversion
System Specs– 120-way machine– 3TB of real storage– IBM zAware – autonomic monitoring– Common Criteria Evaluation
Assurance Level 5+ (un-matched)– IBM DB2 Analytics Accelerator
© 2013 IBM Corporation7
zEnterprise EC12 Hardware – Available since Sept 2012
Continued aggressive investment in H/W + S/W co-design
Hardware Transaction Memory (HTM)– Better concurrency for multi-threaded/parallel applications– Fine-grained concurrency
Run-time Instrumentation (RI)– Real-time feedback on dynamic program characteristics– Enables increased optimization by Java
2GB page frames– Improved performance targeting 64-bit heaps
Page-able 1MB large pages using Flash Express – Better versatility of managing memory
Shared-Memory-Communication– RDMA over Converged Ethernet
zEnterprise Data Compression accelerator– FPGA-based acceleration of gzip
Misc new instructions– Software hints directives– Traps– Decimal conversion
System Specs– 120-way machine– 3TB of real storage– IBM zAware – autonomic monitoring– Common Criteria Evaluation
Assurance Level 5+ (un-matched)– IBM DB2 Analytics Accelerator
© 2013 IBM Corporation8
Inflection Point in Computing?
Topics discussed in a typical week at work
� The end of single-thread performance
� The evolution of HPC into analytics and optimization in the enterprise
� “BigData”
� The “Cloud”
� Systems of engagement, scripting…
Scripting EcosystemScripting Ecosystem
XCAT
© 2013 IBM Corporation9
Evolution of the Enterprise Computing Ecosystem
WebSphereCICS/IMS
DB2
Scoring
OLTPIndustry has spent the last decade focusing on OnLine Transaction Processing (OLTP)
• Enabling/optimizing data persistency and serving• Internet of Things• Trillions of transactions/day• Massive amounts of data (structure/unstructured)
© 2013 IBM Corporation10
Cognos
Evolution of the Enterprise Computing Ecosystem
InfoSphereInfoserver
InfoSphereWarehouse
SPSSCPLEX
WebSphereCICS/IMS
DB2
ILOGScoring Rules
SPSS
BAOBusiness Intelligence, Analytics and Optimization
• A clear need to understanding how to interpret/optimize/predict the dataeg. fraud detection, customer relations management, low-latency trading etc
© 2013 IBM Corporation11
Cognos
Evolution of the Enterprise Computing Ecosystem
InfoSphereInfoserver
InfoSphereWarehouse
SPSSCPLEX
WebSphereCICS/IMS
DB2
ILOGScoring Rules
SPSS
CloudEconomies of scale for computational infrastructure through 3rd party hosting
Driving down costDisruptive changes to IT industry
© 2013 IBM Corporation12
Significant presence for traditional enterprise languages
Base: North American and European enterprise software developers; Source: Forrsights Developer Survey, Q1 2013
“How do you allocate the time you spend writing code across the following programming languages?” (Enter a percentage for each)
Java leads developer language choice
12
=> ~37%
© 2013 IBM Corporation13
Scripting languages gaining significant momentum
Base: North American and European enterprise software developers; Source: Forrsights Developer Survey, Q1 2013
“How do you allocate the time you spend writing code across the following programming languages?” (Enter a percentage for each)
Java leads developer language choice
13
=> ~37%
=> ~26%
© 2013 IBM Corporation14
Cognos
Evolution of the Enterprise Computing Ecosystem
InfoSphereInfoserver
InfoSphereWarehouse
SPSSCPLEX
WebSphereCICS/IMS
DB2
ILOGScoring Rules
SPSS
ScriptingCloud and DevOps represent a new and fast evolving tool-chain
• Client-side scritping languages migrating to the servers• Somewhat foreign to the traditional enterprise space• Moving very quickly
Scripting EcosystemScripting Ecosystem
© 2013 IBM Corporation15
(Perceived) Single Thread Performance/Latency Matters
� SLA -> Service Level Agreements– Vendor guarantees response-time of transactions– Increasingly intelligent transactions
� Batch windows– Fixed elapsed window to complete
eg. Balance books overnight before opening for business next day
� Plethora of S/W idioms that do not fall easily into divide-and-conquer
– Finite-state-machine– Real-time analytics– Queuing/dispatching– Enterprise middleware
� Practical challenges of coarse-grain parallelism– Even very coarse parallelism can be non-trivial to
implement
� Fine-grained parallelism… hold your breath?
© 2013 IBM Corporation16
Performance Innovation is no-Longer just a Processor Game
A range of aggressive hardware and systems designs – Fit-for-purpose/Hybrid systems– Appliances (eg. Netezza)– GPUs/FGPAs– Co-processors (eg. crypto)– Transactional memory
Winners/losers will be defined by a few key criteria– Time-to-market and prevalence of core infrastructure
– Stack opportunity is typically unclear + long lead-time/high cost => risk
– Ease of adoption/integration 1.Runtime and middleware (easiest)2.Compiler optimization3.New programming models/libraries4.Hand written assembler (hardest) IBM® DB2® Analytics Accelerator for z/OS is
a high-performance appliance that integrates IBM Netezza and zEnterprise technologies. The solution delivers extremely fast results for complex and data-intensive DB2 queries on data warehousing, business intelligence and analytic workloads.
IBM® DB2® Analytics Accelerator
© 2013 IBM Corporation17
� Transparent exploitation for TCP sockets based applications � Compatible with existing TCP/IP based load balancing solutions� Up-to 40% reduction in end-to-end transaction latency� Slight increase in CPU is due to very small message size in this workload (~100bytes). Workloads with larger
payloads are expected to show a CPU savings
Shared Memory Communications (SMC-R):Exploit RDMA over Converged Ethernet (RoCE) with qualities of service support for dynamic failover to redundant hardware
SMC-R vs TCP/IP (OSA)W AS Libe rty< ->DB2 W orkloa d
-40.00%
3.11%
-45.00%
-40.00%
-35.00%
-30.00%
-25.00%
-20.00%
-15.00%
-10.00%
-5.00%
0.00%
5.00%
10.00%
Latency CPU Consumption per tran
Per
cent
Cha
nge
SMC-R vs TCP/IP (OSA) (40 ClientConnections)
OSA RNIC
TCP
IP
Interface
SMC-R
Sockets
Middleware/Application
System z
IP Network
(Ethernet)
RDMA Network
RoCE (CEE)
ETH RNIC
TCP
IP
Interface
SMC-R
Sockets
Middleware/Application
SMC-R enabled platform
TCP connection establishment over IPDynamic negotiation for SMC -R
Data Flows using RDMAover RoCE
OSA RNIC
TCP
IP
Interface
SMC-R
Sockets
Middleware/Application
System z
IP Network
(Ethernet)
RDMA Network
RoCE (CEE)
ETH RNIC
TCP
IP
Interface
SMC-R
Sockets
Middleware/Application
SMC-R enabled platform
TCP connection establishment over IPDynamic negotiation for SMC -R
Data Flows using RDMAover RoCE
(Controlled measurement environment, results may vary)
© 2013 IBM Corporation18
zEnterprise Data Compression (zEDC)
(Controlled measurement environment, results may vary)
� Exploited transparently through strandard Java APIs (eg. java/util/zip)� Up-to 10x improvement in CPU time compressing data compared to L1 zlib� Compression ratio of ~4x
© 2013 IBM Corporation19
GPU Acceleration on Standard Java Arrays
2x faster
48x faster !
Chart in Logarithmic scale
© 2013 IBM Corporation20
A New Age for Compilers and Managed Runtimes
Compilers and managed runtimes will need to evolve
Today:– Optimize once for single static architecture – Synergy with micro-architecture is open-loop– Parallelism is in its infancy– Some dynamic in Java (Just-in-time)
Tomorrow:– Deeper synergy with micro-architecture – Providing separation of interests to hedge risk
from large changes to h/w– Automatic parallelism (thread, data, etc)– Hybrid systems, accelerators, fit-for-purpose– Adaptive, dynamic and specialized– JITing for light weight scripting environments
z/OS Multi-Threaded 64 bit Java Workload 16-Way ~12x Improvement in Hardware and Software
0
20
40
60
80
100
120
140
160
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Threads
Nor
mal
ized
Thr
ough
put
zEC12 SDK 7 SR3Aggressive + LP Code Cache zEC12 SDK 7 SR1
z196 SDK 7 SR1
z196 SDK 6 SR8
z10 SDK 6 SR4
z10 SDK 6 GM NO (CR or Heap LP)
z9 Java 5 SR5 NO (CR or Heap LP)
© 2013 IBM Corporation21
zEC12 Runtime-Instrumentation: H/W for Managed Runtimes
Compilation thread
Method A
Bytecodeindex
Binary address
0x0020 1
Count
290
0x0080 20 3000
Method B// C code
void processBuffer() {
…
}
Application thread
RI buffer
Sampling thread
Call
Update
Map binary addresses of JITCODE to bytecode
Send an asynchronous event every 2 ms
This buffer is filed every 20,000 cycle by CPU
J9 Runtime
(e.g., interpreter)
RIONRIOFF
Java code
JITCODE
Generate
Execute
Compilation thread
Method A
Bytecodeindex
Binary address
0x0020 1
Count
290
0x0080 20 3000
Bytecodeindex
Binary address
0x0020 1
Count
290
0x0080 20 3000
Method B// C code
void processBuffer() {
…
}
Application thread
RI buffer
Application thread
RI buffer
Sampling thread
Call
Update
Map binary addresses of JITCODE to bytecode
Send an asynchronous event every 2 ms
This buffer is filed every 20,000 cycle by CPU
J9 Runtime
(e.g., interpreter)
RIONRIOFF
Java code
JITCODE
Generate
Execute
� Highly configurable trace-sampling mechanism– Events: Data/instruction cache miss information, register values– Paths: Last N taken branches – Correlated value, event and path profiling
� Integrating into IBM JVM– Java Runtime Environment is designed to be adaptive and self-tuning – RI enhances JRE decision-making by providing real-time feedback
© 2013 IBM Corporation22
RI Example: Path Splitting/Specialization
B1
B4’
B5’j = y >> 2
B7
B1
B4
B5j = y/x
B7
B2x=4
B6
B4
Path can be specialized
B3x=i
B2x=4
B3x=i
compiler
B5j = y/x
© 2013 IBM Corporation233 © 2013 IBM Corporation
Binary Optimization
�Binary Optimization technology being built by IBM Research (Tokyo)– Solution for the “Dusty Deck Problem”: Can re-optimize the existing
large body of legacy code that the customers are unable and/or unwilling to recompile
– Built on top of IBM Testarossa Optimizer
�Value– Upgrade the binary from really old ISA to the exact ISA of the machine
the code is running on– Inject the latest COBOL optimization technology into legacy programs– Ability to start utilizing profiling and RI-based feedback into to optimize
the program
�Experimental Results*– up-to 4.62x and average 1.89x over the original binary on z196– up-to 3.31x and average 1.94x over the original binary on zEC12
�Alpha-level prototype available on developerWorks
Technology that enables re-optimization of legacy COBOL binaries on the latest System z without requiring source-level re-compilation.
BinaryOptimizer
legacy COBOLbinary with ESA390
instructions
optimized COBOLbinary with zEC12
instructions
* Measured using twelve benchmarks in the internal testsuite, used by the COBOL compiler dev. team.
© 2013 IBM Corporation24
Concurrency and Parallelism in the Enterprise
Traditionally a play in niche spaces (eg. HPC)
With industry focus on business intelligence, analytics and optimization, stakes have reached a new high
� ‘Programming models for parallelism becoming mainstream– Eg. java/util/concurrent, fork/join and Lambdas
� Traditional Programming models evolving to meet the needs of enterprise computing
– Eg. OpenMP adds tasks
� Hardware transactional memory… it’s here!
� Auto-parallel remains elusive!
© 2013 IBM Corporation25
Transactional Execution: Concurrent Linked Queue
� ~2x improved scalability of juc.ConcurrentLinkedQueue
� Unbound Thread-Safe LinkedQueue– First-in-first-out (FIFO)
• Insert elements into tail (en-queue)• Poll elements from head (de-queue)
– No explicit locking required
� Example usage: a multi-threaded work queue– Tasks are inserted into a concurrent linked queue as multiple worker threads
poll work from it concurrently
head
node
node
node
tail
….
last node
En-queue
first node
De-queue
New TX-base implementation
Traditional CAS-base implementation
(Controlled measurement environment, results may vary)
© 2013 IBM Corporation26
Transaction Lock Elision on HashTable.get()Java Prototype
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Threads
Th
roug
hpu
t (o
ps/s
ec)
HTM Example: Transactional Lock Elision (TLE)
Threads must serialize despite only reading… just in-case a writer updates the hash
read_hash(key) {
Wait_for_lock();
read(hash, key);
Release_lock();
}
Thr1: read_hash()
Thr2: read_hash()
Thr3:read_hash()
T
Lock elision allows readers to execute in parallel, and safely back-out should a writer update hash
read_hash(key)
TRANSACTION_BEGIN
read hash.lock;
BRNE serialize_on_hash_lock
read (hash, key);
TRANSACTION_END
Thr1: read_hash() … Thr3: read_hash()
T’
© 2013 IBM Corporation27
Java8: Language Innovation – Lambdas and Parallelism
New syntax to allow concise code snippets and expression– Useful for sending code to java.lang.concurrent– On the path to enabling more parallelisms
http://www.dzone.com/links/presentation_languagelibraryvm_coevolution_in_jav.html
© 2013 IBM Corporation28
So what about auto-parallelism?
A MOUTHFUL to chew on:
� Will the convergence of analytics/optimization and enterprise in the context of the end of the single-thread-performance-roadmap on the cloud provide enough momentum/focus to see some real-world (eg. Java) breakthroughs in auto-parallel technology?
© 2013 IBM Corporation29
Concluding Remarks
Lots to be excited about• Significant acceleration in innovation in the hardware/software interface• It’s never been a better time to be a runtime/compiler developer
Cognos
InfoSphereInfoserver
InfoSphereWarehouse
SPSSCPLEX
WebSphereCICS/IMS
DB2
ILOGScoring Rules
SPSS
Scripting EcosystemScripting Ecosystem
© 2013 IBM Corporation30
© Copyright IBM Corporation 2013. All rights reserv ed. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.
© 2013 IBM Corporation31
Creating an Open Community to drive full stack innovation for the cloud: OpenPOWER
� Industry’s first open system design for cloud data centers
� Custom development group for hyperscaleservers including hardware designs, firmware and software.
� Addresses need for industry-based innovation across processors, network and storage I/O
� OpenPower will create an ecosystem for Power Systems
• IBM will contribute OpenSource software
• IBM will enable industry participation through open documentation
• IBM will license chip design intellectual property (IP) to allow customization
OpenPOWER Consortium – IBM POWER server technology creates an open community to drive innovation for the Cloud
Systems Management
Firmware, Hypervisors and Operating Systems
Processors
Semiconductor Technology
Systems
OpenPOWER Consortium
31