53
Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting Ankara

Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

  • View
    226

  • Download
    6

Embed Size (px)

Citation preview

Page 1: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Rewriting The Rules For Enterprise ITEnterprise Grid Orchestrator

Christof Westhues SE Manager EMEA

Platform Computing 2007/03/01

National Grid Meeting Ankara

Page 2: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 20032

Platform Enterprise Grid Orchestratorboosting EU-Grid Technology exploitation

Agenda

Increasing the industrial impact of EU Grid Technologies Programme

About Platform Computing

Understanding Industry requirements

Unified Grid resource layer

Integrate your Grid solution with Platform EGO

Platform Collaborations – EGEE, DEISA etc.

Conclusion - Open for new ideas

Page 3: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 20033

Platform Enterprise Grid Orchestratorboosting EU-Grid Technology exploitation

Increasing the industrial impact of EU Grid Technologies Programme with Platform Enterprise Grid Orchestrator

The EU Grid Technologies Programme targets the logical next step: 'From Vision to Impacts in Industry and Society'

How to make this real?

Platform Computing holds probably the largest commercially productive install base of Grid infrastructure in industry worldwide.

Now introducing the Enterprise Grid Orchestrator (EGO), the first large scale rolled out Grid-SOI (Service Oriented Infrastructure) for technical as well as business computing.

Platform Computing EGO invites all Grid technology solutions to integrate with its unified Grid resource layer.

Page 4: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Platform Computing

Page 5: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 20035

Platform Computing

The leading systems infrastructure software company accelerating applications and delivering I.T. agility to High Performance Data Centers

14 years of grid computing experience

Global network of offices, resellers & partners

7 x 24 world-wide support and consulting

Gartner Group 2006 “Cool Vendor” award in I.T. Operations Management

Page 6: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 20036

Over 2,000 leading Global Customers

Page 7: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 20037

Our Customers: from all verticals

ElectronicsFinancial

Services

Industrial

Manufacturing

Life

Sciences

Government

& Research

• AMD• ARM• ATI• Broadcom• Cadence• HP• IBM• Motorola• NVIDIA• Qualcomm• Samsung• ST Micro• Synopsys• Texas Instr.• Toshiba

• Fidelity Investments

• HSBC

• JP Morgan Chase

• Mass Mutual

• Royal Bank of Canada

• Sal Oppenheim

• Société Générale

• Lehman Brothers

• BMW

• Boeing

• Bombardier

• Airbus

• Daimler Chrysler

• GE

• GM

• Lockheed Martin

• Pratt & Whitney

• Toyota

• Volkswagen

• AstraZeneca

• Bristol Myers- Squibb

• Celera

• Dupont

• GSK

• Johnson &Johnson

• Merck

• Novartis

• Pfizer

• Wellcome Trust Sanger Institute

• Wyeth

• ASCI

• CERN

• DoD, US

• DoE, US

• ENEA

• Fleet Numeric

• MaxPlanck

• SSC, China

• TACC

• Univ Tokyo

Other

Business

• Bell Canada

• Cablevision

• Ebay

• Starwood Hotels

• Telecom Italia

• Telefonica

• Sprint

• GE

• IRI

• Cadbury Schweppes

Page 8: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Understanding Industry requirements

Page 9: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 20039

Understanding Industry requirements

Grid value: shared resources & shared usage.

Unify many different users AND multiple different workload typesAvoid building “Grid-Silos”: don’t become part of the

problem

Primary target is “agility” – speed & ease of changeDriven by business process & business change needsAs consequence of handling all workload in the Grid,

orchestration, scaling, acceleration, results in agility

Lets have a look at the users

Industry – generically: professional users aiming to create results (€,$,₤) using the tool “Grid” –

Call them customers (change of perspective)

Page 10: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200310

Understanding Industry requirements

Quality requirements

Reliability (self-healing, recovery from incidents, policy driven proactive problem containment, no job loss during operation or in error condition, while reconfig or failover.

Performance (n*10millions jobs per day throughput with 90% job-slot utilization based on 15min job-runtime, max 5min for failover)

Scalability (n*1000’s users, hosts, n*millions jobs in one logical cluster at any time, n*10millions jobs per day throughput, n*1000’s way-parallel jobs)

Page 11: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200311

LSF Roadmap

LSF product roadmap is based on the feedback and interviews with 75+ customers including:

Agilent, Airbus, AMD, ARM, Apple, ATI, BASF, BMS, Boehringer Ingelheim, Boeing, Broadcom, Caltech, CEA, Cineca, Cinesite, Conexant, Daimler Chrysler, DEISA, Devon Energy, Disney, DoD (ARL, ASC, ERDC, MHPCC, NAVO), DoE (LANL, LLNL, Sandia), Dreamworks, Emulex, Engineous, Ferrari, Fleet Numerics, Ford, Freescale, GE, GM, Halliburton/Landmark, Harvard, Hilti, HP, IDT, Intel, J&J, LandRoverJaguar, Lockheed Martin, LSILogic, Magma, Merck, Motorola, MSC, MTU, NCAR, NCSA, Nissan, NOAA, Novartis, NovoNordisk, NVidia, Philips, Pratt & Whitney, Pfizer, PSA, Qlogic, Qualcomm, RBC, Renesas, Samsung, Sandisk, Seagate, Shell, Skyworks, Synopsys, TACC, TenorNetworks, TI, Toshiba and Volvo

Page 12: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200312

Understanding Industry requirements

Quality requirements

Why scaling counts: Performance and Scalability translates into Reliability

Reliability can be measured as “MTBF” – Mean Transactions (=Jobs) Between FailurePlatform technology meets this requirement – Technology-

Leader

Support 24/7 around the globe

Non-Technical Quality requirements

Focus on Grid technology – commitment -

Reliable partner: experienced, stable, profitable.

Page 13: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Unified Grid resource layer

Page 14: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200314

Network Bandwidth

Servers Licenses Data Storage

Heterogeneous Enterprise Resource

Enterprise Grid Problem: workload characteristics

ApplicationApplication Application Application Application

HPC & Enterprise Applications

Unpredictable infinite demand

Result: under-provisioning or over-provisioning

Finite compute resources

Page 15: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200315

IT Architectures Are Still Statically Coupled and Silo’d

ApplicationApplication Application Application Application

Core Applications in the Data Center

Unpredictable, Infinite Demand

With Multiple Engineering groups collaborating on multiple designs, core and business applications can consume vast amounts of computing resources

Finite Computing Resources

Applications are “siloed”, often procured out of different budgets at different times for different purposes

Page 16: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200316

The Need is for Variable Resources to Meet Variable Business Demand The Need is for Variable

Data Center Business “Pain Points”

Underutilized Resources

Diffculty meeting SLAs

Costly I.T. Environment

Complex

Unpredictable

Some server silos have insufficient capacity while there is an excess capacity in others

It is difficult to meet application SLAs because resources may not be available when required

With application silos underutilized, excess capacity, cooling, space and power are requiredCoordination of resources is complex, time-

consuming and error-proneHardware failures, outages or insufficient capacity

makes the environment unpredictable

Results of statically Coupled and Silo’d Infrastructure

Page 17: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200317

Model architecture

ApplicationApplication Application Application Application

Core Applications in the Data Center

Unpredictable Infinite Demand

Computing Resources are Finite

Create a Shared Pool of Computer Systems

Decouple Resources from Applications

Page 18: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200318

System Resource

Orchestration

ResourcesPlug-ins

InfrastructurePlug-ins

Platform EGO Standard Services

Application Workload

Management

Open & Decoupled Architecture Platform Enterprise Grid Orchestrator

Platform LSF HPC

API/CLI

Platform VOVMO & ASE

API/CLI

3rd Party Middleware Integration

API/CLI

Applications

LS MDA EDA CAE FSI VM’s J2EE DB’s ERP CRM BI

H/W

Solaris

H/W

Aix

H/W

Windows

H/W

Linux

H/W

ServersGrid Devices H/W

Desktops

AllocateManage Execute

Platform EGO Kernel

Fail-over

Platform LSF

Platform Symphony

API/CLI API/CLI

Portal Service

Logging Service

Deployment Service

Event Service

Service Director

Data Cache

SNMP

Security

Platform EGO SDK/API

Storage

License

e.g. Infiniband

SOI

SOA

Page 19: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200319

Example: Dynamic Resource Allocations – Live SOI

Platform EGO Foundation

Host Group: Linux 2.4

Platform EGO responds to requests from consumers and allocates supply according to policy – Service Oriented Infrastructure

Resource allocation: min, max, conditions, resource req.

Dynamic response: Resource re-allocation based on policies (=> SLA’s) – “lend&borrow”

Dynamic response: acquisition of additional resources

Host Group: Linux 2.4

Host Group: Windows NT

Page 20: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

3rd party Middleware integration

Integrate your Grid solution with Platform EGO

Page 21: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200321

Integrate your Grid solution with Platform EGO

Meet industrial quality requirements AND deploy innovative technologies and methods

Specific and targeted solutions as well as general purpose workload adapters can join one unified resource Grid

Reliability (self-healing, recovery from incidents, policy driven proactive problem containment)

Dynamic Resource Allocation – peak power on demand

Scalability & Performance

Page 22: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200322

Integrate your Grid solution with Platform EGO

Platform EGO offers by open API/SKD policy based access to all resources in the Grid.

Access the same resource Grid from & for all workload types or Grid solutions No Grid silos!

Access to resources on EGO includes dynamic allocations within SLA guarantees.“Breathing” resource allocations: SLA: minimum, maximum

– lend&borrow

This may well replace traditional static Advanced Reservations that were building up “virtual silos” – a virtual grid-based flavor of silo’ed infrastructure Grid technology was supposed to make redundant. No Grid silos – not even virtual!

Page 23: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Platform Collaborations

Page 24: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200324

Platform Engagements and Collaborations

Currently, Platform Computing is engaged at:

QOSCOS

DEISA

EGEE

Page 25: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Platform Collaborations - QOSCOS

Page 26: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200326

What is QosCosGrid?

IST Proposal

Specific Targeted Research Project (STREP)

IST Call 5

FP6-2005-IST-5  

Quasi-Opportunistic Supercomputing for Complex Systems in Grid Environments

(QosCosGrid)

Part. # Participant organisation name Short name

1* University of Ulster, United Kingdom UU2 The University of Queensland, Australia UQ3 Israel Institute of Technology, Israel TECH4 Cranfield University, United Kingdom CU5 Universitat Pompeu Fabra, Spain UPF6 Eötvös Loránd University, Hungary ELU7 National Inst. for Research in Computer Science and Control, France INRIA8 Poznan Supercomputing and Networking Centre, Poland PSNC9 University of Amsterdam, Netherlands UA10 Platform Computing PCC

Page 27: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200327

What is QosCosGrid?

Target & Definition, from the proposal paper:

…. “Whereas supercomputing resources are more or less dependable, the grid approach is characterized by an opportunistic sharing of resources as they become available. This distributed quasi-opportunistic supercomputing, while not offering the quality of service of a supercomputer, will be to some degree better than the pure opportunistic grid approach. Furthermore it will enable users to develop applications with supercomputing requirements without the need to deploy supercomputers themselves. …

QosCosGrid is, therefore, an effort to use the best from two worlds: the opportunistic approach of the grid technology to sharing and using resources whenever they become available, and the reliant or dependable approach of the supercomputing. By developing an infrastructure for quasi-opportunistic supercomputing, QosCosGrid aims at providing a reliable, effortless and cost-effective access to the enormous computational and storage resources required across a wide range of CS research areas and application domains and industrial sectors.”

Prof. Dr. Dubitzki, University of Ulster

Page 28: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200328

What is QosCosGrid?

The Proposal to the EU-Commission (click

here )

Why Platform Computing?Researchers from initiating University of Ulster remembered Platform Computing from D-Grid (German e-science initiative) working groups and asked for Platform participation

EU-Commission funding rule: for each research project there must be a commercial partner

Platform is invited to enter the academic IT research scene in Europe and by this increase success in a currently under developed market

Platform was offered a package of 45 person-months with a total of +400000 Euro funding

Page 29: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200329

QosCosGrid Project Plan

QosCosGrid UU

UQ

TE

CH

CU

UP

F

EL

U

INR

IA

PS

NC

UA

PC

C

Workpackags and Tasks 1 2 3 4 5 6 7 8 9 10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

WP1 Grid Services for CS Simulations PSNC 159 1 45 11 3 19 6 22 29 14 9

T1.1State-of-the-art/gap analysis of grid technologies for CS modelling UQ 28 0 15 1 0 4 0 2 3 2 1

T1.2Selection and adaptation of grid monitoring and meta-scheduling services to specific requirements of CS modelling PSNC 32 0 8 4 0 4 0 0 9 6 1

T1.3Design and implementation of fault tolerance protocols for point to point and grid middleware aware CS communication routines

INRIA 22 0 0 1 0 0 0 20 0 0 1

T1.4Design and development of CS oriented interfaces to grid services UPF 21 1 8 2 0 4 2 0 3 0 1

T1.5Adaptation and integration of advanced features provided by local scheduling and storage systems according to specific CS requirements

PCC 16 0 0 2 0 4 0 0 6 0 4

T1.6Integration of storage management and data transfer systems with CS oriented grid services and interfaces PSNC 20 0 0 1 2 2 2 0 6 6 1

T1.7 Remote steering of grid middleware aware CS simulations UQ 20 0 14 0 1 1 2 0 2 0 0

WP2 Grid Services for QO Supercomputing TECH 125 1 10 70 1 5 10 0 10 0 18

T2.1 State-of-the-art/gap analysis of QO supercomputing TECH 19 0 5 7 0 0 3 0 3 0 1

T2.2 Multi Resource QoS-aware Provisioning TECH 26 0 0 21 0 0 0 0 0 0 5

T2.3 QoS aware resource orchestration TECH 31 0 0 21 0 2 3 0 3 0 2

T2.4 QoS resource orchestration for a multi-applications Grid environmentTECH 32 1 0 21 0 1 3 0 3 0 3

T2.5 Accounting and billing services PCC 17 0 5 0 1 2 1 0 1 0 7

WP3 CS Simulations on the Grid (use case scenarios) UPF 166 22 23 6 27 17 27 9 4 29 2

T3.1 Use cases requirements analysis and specification UA 29 3 5 1 3 4 5 1 3 3 1

T3.2 Living simulations (protein folding, astrohpysics apps) UA 46 6 5 0 2 4 5 1 0 23 0

T3.3 Evolutionary computation ELU 36 10 5 0 6 4 10 1 0 0 0

T3.4 Co-evolutionary agents models (supply chain) CU 28 0 5 0 12 4 6 1 0 0 0

T3.5 Integration with WP1 and WP2 and demonstrations UU 27 3 3 5 4 1 1 5 1 3 1

WP4 Concertation CU 12 1 0 1 3 1 1 2 1 1 1

T4.1 Exploitation of synergies / technical exploitation PCC 2 0 0 0 0 0 0 0 0 1 1

T4.2 Joint fora for exchange and dissemination INRIA 3 0 0 0 1 0 0 1 1 0 0

T4.3 Co-ordination of standardisation efforts TECH 2 0 0 1 1 0 0 0 0 0 0

T4.4 Repository of reference implementations and grid middleware CU 1 0 0 0 1 0 0 0 0 0 0

T4.5 Collaboration on research inventors and roadmaps UU 2 1 0 0 0 1 0 0 0 0 0

T4.6 Indicators and impact assessment ELU 1 0 0 0 0 0 1 0 0 0 0

T4.7 Training activities UA 1 0 0 0 0 0 0 1 0 0 0

WP5 Exploitation and Dissemination PCC 34 5 1 1 2 1 1 6 1 1 15

T5.1 Promotional and dissemination activities UU 16 5 0 0 2 1 0 5 0 0 3

T5.2 Targeted user groups & exploitation PCC 18 0 1 1 0 0 1 1 1 1 12

WP6 Project Management UU 31 20 0 6 0 2 0 3 0 0 0

T6.1 Overall project coordination UU 21 20 0 0 0 0 0 1 0 0 0

T6.2 Technical coordination TECH 8 0 0 6 0 0 0 2 0 0 0

T6.3 Quality management UPF 2 0 0 0 0 2 0 0 0 0 0

Grand total 527 50 79 95 36 45 45 42 45 45 45Planned in A4 1 9% 15% ### 7% 9% 9% 8% 9% 9% 9%

Year 3

Gantt Chart -- Project Years and Quarters

PMsPartner Number Year 1 Year 2

Platform (PCC)

marked

30 months runtime

Page 30: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200330

QosCosGrid Technology Stack & LSF

QosCosGrid Technology Stack: QosCosGrid research and development efforts will be based on the existing grid technology (such as GT4[[i]], Glite[[ii]] and LSF[[iii]] from PCC), and will focus on three additional layers, as depicted in Figure below.

To achieve that, one of the first activities in the project will be the roll-out of a world-spanning Platform LSF-MultiCluster grid – from Ireland across Europe, Israel and Australia.

G rid Fabric

In terfaces, Serv ices, Too ls

A p p lications

M id d lew are

Dem

o 1

Dem

o 3

Dem

o N

Dem

o 2

[[i]] GT4: www.globus.org/toolkit / [[ii]] Glite: glite.web.cern.ch/Glite / [[iii]] LSF: www.platform.com/Products

Page 31: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Platform Collaborations - DEISA

Page 32: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200332

Heterogeneous job submissions and Co-Allocation capability

OpenPBS / PBSPro IBM LoadlevelerPLATFORM LSF

Develop and extend heterogeneous job submission

capability (UNIVERSUS)

NEC NQS (optional)

Virtualized Infrastructure

Page 33: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200333

Heterogeneous job submissions and Co-Allocation capability

OpenPBS / PBSPro IBM LoadlevelerPLATFORM LSF

Develop and extend heterogeneous job submission

capability (UNIVERSUS)

NEC NQS (optional)

Virtualized Infrastructure

Co-Allocation: Heterogeneous Multi-Site resource allocation

Example:

Give me 200 CPU on Site1 and 300 CPU on Site2 at the same time

200 CPU300 CPU

Page 34: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Platform Collaborations - EGEE

Page 35: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200335

Platform Computing - EGEE-Business-Associate

The collaboration “Plan” Step 1

Immediate improvements for the EGEE users and resource providers

Technology boost

SLA Scheduling

Parallel job control and accounting

Resource aware scheduling – double compute efficiency

What‘s next? Step 2

Mid term target: production Grid unifying all resources AND all users

Enable & integrate with new user groups and their resources

All kind of applications: commercial code; complex systems

Long term target: SOA/SOI for Service Oriented Science

„IT-Agility“ for scientific computing

Introduce novelties faster

respond to changing requests in time

Page 36: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200336

EGEE & Platform: the “Plan”

The collaboration “Plan” Step 1: 4 Actions

1st Action: improve LSFgLite integrationPlatform LSF is one of the supported batch systems of gLite.

Actually, about 45% of all CPUs in EGEE are on LSF

May include version maintenance as well as performance improvements

Will include improved documentation and communication

Leeds to better understanding the capabilities of LSF in order to build complex algorithms that may benefit from information passing to use all the features of LSF

Page 37: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200337

EGEE & Platform: the “Plan”

The collaboration “Plan” Step 1: 4 Actions

2nd Action: SLA Schedulingexploit LSF and gLite features to enhance user and resource

provider capabilitiesSLA scheduling helps both: for the

User it provides guaranteed result delivery – in time or in troughput

Resource provider, it translates to „least impact scheduling“, that is: serving the SLA user while there is still room left to host other requests. I other words: handling different Service Levels, working with different customers, at the same time

Expected results:Resource providers will offer more resources to EGEE users under

well defined SLAsUser perceives predictable result delivery, predictable behaviour of

the Grid

Page 38: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200338

EGEE & Platform: the “Plan”

The collaboration “Plan” Step 1: 4 Actions

3rd Action: Parallel application supportgLite today supports sequential and provides basic support for

parallel jobs based on mpich

Exploit LSF-HPC featuresLSF-HPC allows control of MPI parallel jobs down to task levelProvides signalling layer for management or workflow control

signalsDelivers accounting that include all children of a parallel applicationMultiple MPI type in one cluster support

Is parallel application support in EGEE easy? No.LSF-HPC might be the best choice to start with.We may identify topics worth a research project / support actionE.g.: parallel application checkpoint / restart

Page 39: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200339

EGEE & Platform: the “Plan”

The collaboration “Plan” Step 1: 4 Actions

4th Action: Resource aware scheduling – double compute efficiency

Exploit LSF featuresLSF supports a generic resource concept, thus data is resource,

tooAll resources can be used for scheduling decisionsScheduling paradigm “job-follows-data” results in up to 50% gain in

compute power

Is Resource aware scheduling in EGEE easy? No.EGEE supports co-location of data and computation based on

sites, but not for computation scheduling within a siteMajor topics in operations model Medium topics for the compute resources, re-think, re-build, re-

budgetMaybe switch to Mid-term horizon …

Page 40: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200340

SLA Scheduling for EGEE

LSF service-level agreement (SLA) scheduling: Is a goal-oriented "just-in-time" scheduling policy that enables the user

to focus on the "what and when" of a project instead of "how“ the resources need to be allocated to satisfy various workload

Defines an agreement between LSF administrators and users

Helps configure workload so that jobs complete on time

Reduces the risk of missed deadlines

Three different types of service-level goals are Deadline

Velocity

Throughput

or a combination of the service-level goals

Page 41: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200341

SLA Scheduling for EGEE: SLA “Deadline”

now

100%= 8 Job-slots

time

Cluster filled to 100%Classical opportunistic scheduling

timenow

SLA 1 consumes 50% of cluster

SLA 1

“deadline”

Free resources for dialog users, real-time requests, online sessions, other workload

100%

“deadline”

I need to work now!

Early enough for m

e

Page 42: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200342

SLA Scheduling for EGEE: SLA “Throughput”

SLA 2 consumes 25% of cluster

now

SLA 2 “troughput”

Free resources for dialog users, real-time requests, online sessions, other workload, other SLAs, …

100%

4 R

esul

ts/h

r

4 R

esul

ts/h

r

4 R

esul

ts/h

r

4 R

esul

ts/h

r

4 R

esul

ts/h

r

I am a scientist, I need just as many results as I can process per time interval.

time

more EGEE users !

Page 43: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200343

EGEE High Performance Parallel Computing

Distributed computation“Imperfectly parallel” – the real world

inter-task-runtime-communication

often implemented using MPI – Message Passing Interface

MPI - Many Possible Implementations

Different communication patterns:

“Neighbour” tasks (defined by problem decomposition topology)

“All to all”, “some to many” (=N-to-M)

Central instance to tasks (commercial code, …)

Page 44: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200344

LSF-HPC – LSF for High Performance Computing

LSF-HPCLSF plus additional functionality

Topology aware scheduling

large SMPs

large Clusters

Task granular control for parallel computation

Generic and vendor specific MPI integrations

Signal forwarding to all tasks

Resource usage accounting for all tasks

Limit enforcement: time, mem, threads, ….

Scalability: +8000 in LSF6.2 / +16000 in LSF7.0

Page 45: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200345

Platform LSF/HPC – Generic integration

Without the generic PJL framework, the PJL starts tasks directly on each host, and manages the job.

Even if the MPI job was submitted through LSF, LSF never receives information about the individual tasks. LSF is not able to track job resource usage or provide job control.

If you simply replace PAM with a parallel job launcher that is not integrated with LSF, LSF loses control of the process and is not able to monitor job resource usage or provide job control. LSF never receives information about the individual tasks.

TaskTask

PJL1st executionhost

TaskTask

2nd executionhost

ArchitectureRunning a parallel job using a non-integrated PJL

Page 46: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200346

Platform LSF/HPC – Generic integration

PAM is the resource manager for the job.

The key step in the integration is to place TS in the job startup hierarchy, just before the task starts.

TS must be the parent process of each task in order to collect the task process ID (PID) and pass it to PAM.

mbatchd mbschdJob submission

LSF Master host

Task

TS TS

Task

RES

PAM

PJL wrapper

PJL

mpirun.lsfsbatchd

1st executionhost

RES

Task

TS TS

Task

2nd executionhost

Architecture: Using the generic PJL framework

Page 47: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200347

LSF-HPC – LSF for High Performance Computing

Advantage for EGEE, users and resource providersFreedom to integrate and use

All MPI types

All compute architectures

May implement optional automated MPI selection, dependent on actual availability – best possible choice

Full application control, ready to implement optional parallel

Preemption - important to guarantee service levels

Suspend/resume

Checkpoint/migrate/restart

Page 48: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200348

Resource aware scheduling for EGEE

The collaboration “Plan” Step 1: 4 Actions

4th Action: Resource aware scheduling – double compute efficiency

Exploit LSF featuresLSF supports a generic resource concept, thus data is resource,

tooAll resources can be used for scheduling decisionsScheduling paradigm “job-follows-data” results in up to 50% gain in

compute power

Is Resource aware scheduling in EGEE easy? No.EGEE supports co-location of data and computation based on

sites, but not for computation scheduling within a siteMajor topics in operations model Medium topics for the compute resources, re-think, re-build, re-

budgetMaybe switch to Mid-term horizon …

Page 49: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200349

EGEE: data handling in the resource center

EGEEusers

Compute nodes

Data nodes

Storage

Controller

Drive

Drive

Drive

Drive

Drive

Drive

Drive

Drive

LA

N

Robot

EGEE jobs

EGEE example operations model Job arrives and is started on compute node

Requested data is ordered from storage robot

Tape mounted and content “data set” provided to compute node via NFS

allocating 2 nodes for 1 job

Page 50: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200350

Resource aware scheduling – up to double compute efficiency

Resource aware scheduling 1Job arrives and is queued, resource requirement e.g. “data=#4711” 2Requested data-set “#4711” is ordered from storage robot by LSF 3Tape mounted and LSF resource “data” is updated4 to “data=#4711” As soon as resource requirements are satisfied, job is 5dispatched to the

right host, holding the right data locally

EGEEusers

R

R

R

R

R

R

R

R

LSF Cluster

Compute & Data nodes

Storage

Controller

Drive

Drive

Drive

Drive

Drive

Drive

Drive

Drive

Robot

EGEE jobs

Q

Q

Q

Resource: dataValue: “identifier”

LSF-mbschd

1

2

3

45

Page 51: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Conclusion

Page 52: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

© Platform Computing Inc. 200352

Conclusion: Increasing the industrial impact of Grid

Increasing the industrial impact of EU Grid Technologies Programme with Platform Enterprise Grid Orchestrator

Platform Computing invites all Grid technology solutions to integrate with its unified Grid resource layer, the Enterprise Grid Orchstrator – EGO -

Platform Computing is open to partner with academia, research and industry to push forward adoption and “impact” of Grid technology.

Contact: Christof Westhues, SE Manager EMEA Platform Computing GmbH, [email protected]

Proline Bilişim A.Ş.Tel : +90 212 236 8070

Fax :+90 212 236 7740

Page 53: Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting

Thank you