Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures...

10/9/2013

CS 655 – Advanced Topics in Distributed Systems

Computer Science Department

Colorado State University

Presented by : Walid Budgaga

Outline

Condor

The Anatomy of the Grid

Globus Toolkit

Motivation High Throughput Computing (HTC)?

Large amounts of computing capacity over long

periods of time.

Measured: operations per month or per year

High Performance Computing (HPC)?

Large amounts of computing capacity for short

periods of time

Measured: FLOPS

Motivation

HTC is suitable for scientific research

Example(Parameter sweep):

Testing parameter combinations to

keep temp. at particular level

op(x,y,z) takes 10 hours, 500(MB) memory, I/O 100(MB)

x(100), y(50), z(25)=> 100x50x25=125,000(145 years)

Motivation Fort Collins

Science Center

Uses Condor for

scientific projects

Source: http://www.fort.usgs.gov/Condor/ComputingTimes.asp

HTC Environment

Large amounts of processing capacity? Exploiting computers on the network

Utilizing heterogeneous resources Overcoming differences of the platforms

By building portable solution

Including resource management framework

Over long periods of time? System must be reliable and maintainable

Surviving failures (software & hardware)

Allowing leaving and joining of resources at any time

Upgrading and configuring without significant downtimes

10/9/2013

HTC Environment

Also, the system must meet the needs of:

Resource owners

Rights respected

Policies enforced

Customers Benefit of additional processing capacity outweigh complexity of

System administrators Real benefit provided to users outweigh the maintenance cost

Other considerations:

The distributive owned resources lead to:

Decentralized maintenance and configuration of resources

Resource availability

Applications preempted at any time

Adds an additional degree of resource heterogeneity

Condor Overview

Open-source high-throughput computing framework for computing intensive tasks.

Manages distributive owned resources to provide large amount of capacity

Developed at the Computer Sciences Department at the University of Wisconsin-Madison

Name changed to HTCondor in October 2012

Condor Overview

Customer agent

Represent the customer job(application)

Can state the its requirement as following:

Need a Linux/x86 platform

Want the machine with the high memory capacity

Prefer a machine in the lab 120

10/9/2013

Condor Overview

Resource agent Represent the resource

Can state its offers as following: Platform: a Linux/x86 platform

Memory: 1GB

Can state its requirements as following:

Run jobs only when keyboard and mouse are idle for 15 m

Run jobs only from the computer department

Never run jobs belong to abc@cs.colostate.edu

Condor Overview

Matchmaker

Matches jobs and resources

based on requirements and offers

Notifies the agents when a match found

Challenges of HTC system:

Software Development

System Administration

Software Development Four primary challenges

Utilization of heterogeneous resources

Requires system portability.

Network Protocol Flexibility

Required to cope with constantly changing of the resource and customer needs

Required for adding new features

Remote file access

Required for giving ability for accessing data from any workstation

Utilization of non dedicated resources

Required for preempt and resume application.

Utilization of heterogeneous resources:

Requires system portability obtained through layered system design

• Network API :

• Connection-oriented and connectionless

• Reliable and unreliable interfaces.

• Authentication and encryption

• Process management API :

• Create , suspend, resume,

and kill a process.

• Workstation statistics API:

• Reports information needed to

Implement resource owner policies

Verify the validation of the applications requirements

10/9/2013

Network Protocol Flexibility:

To cope with adding new services in HTC without frequently updating HTC

components, general purpose data format may be used

• For example: Condor uses protocol similar to RPC

• Condor:

Remote file access(1):

To guarantee that HTC applications can access their data

from any workstation in the cluster.

• Three possible solutions:

• Using existing distributed file system (NFS)

• Authenticates customer application,

• Privileges need to assigned, or

• Grant file access permission

Remote file access (2):

• Implementing data file staging

• Transferring input and output files to remote workstation

specified by customer

• Require free disk space on workstation

• High cost for large data files

Remote file access(3):

• Redirecting file I/O system calls

• Interposing HTC between application & operating system

• By Linking application with an interposition library

• Does not require file storage on remote workstation

• Reduce performance.

• Difficult to develop & maintain portable interposition

Utilization of non-dedicated resources

Requires the ability for preempting and resuming application.

It can be obtained using checkpoints

Checkpoint:

snapshot of the state of the executing program

It can be used to restart the program at a later time

Provide reliability

Enable preemptive-resume scheduling

Checkpoints in Condor (1)

Used as migration mechanism

Job scheduler to migrate jobs from workstations to others

Used to resume a vacated jobs

The program has the ability to checkpoint itself

Using a checkpointing library

To provide additional reliability

HTCondor can be configured to write checkpoints periodically

10/9/2013

When checkpoints are stored:

Periodically, if HTCondor is configured

At any time by the program

When higher priority job has to start on the same machine

When the machine becomes busy

Storing of checkpoints

By default,

checkpoints are stored on local disk of the machine

where job was submitted

However,

It can be configured to stored them on checkpoints server

Administrator has to answer to:

Resource owners

By guaranteeing that HTC enforces their policies

Customers

By ensuring receipt of valuable services from HTC

Policy makers

By demonstrating that HTC is meeting the stated goals.

Access Policies

Specifies when and how the resources can be accessed and

by whom

The policies might be specified using a set of expressions

For example in Condor:

Requirements (true: to start accessing the resources)

Rank (preference)

Suspend

Continue

Vacate (notification to stop using resources)

Kill (immediately stopping using the resources)

System Administration Access Policies

Example from Condor:

10/9/2013

Reliability

The HTC must be prepared against failures and

must be automate failure recover for common failure.

It is not easy job

Detect difference between normal and abnormal termination

Don’t leave running applications unattended

Choose the correct checkpoint to restart

Decide when it is safe to restart the application

Determine & avoid bad nodes 32

System logs

It is primary tools for diagnosing system failures. It gives the ability to reconstruct the events leading up to the failure .

Problems and Suggested solutions

Logs files can grow to unbounded size.

Detailed logs for recently events and summaries for old information

Managing distributed log file

Store logs centrally on a file server or a customized log server

Provide single interface by installing logging agents on each workstation

Monitoring and Accounting

It helps the Administrator to:

Assess the current and historical state of the system

Track the system usage

CondorView Usage Graph

Security (1)

Possible attacks

Resource attack

Unauthorized user gains access to a resource

Authorized user violates resource owner’s access policy

Customer attack

Customer’s account or files are risked via HTC environment

Security (2)

To protect against an unauthorized of resource access policy

Resource owner may specify authorized users in his access policy

Condor Example:

Requirement = (Customer == “jbasney@cs.wisc.edu”) ||

( Customer == “miron@cs.wisc.edu”)

10/9/2013

Security (3)

To protect against violations of resource access policy,

The resource agent may:

Set resource consumption limit by using system API

Run the application under “guest” account

Set file system root directory to “sandbox” directory

Intercept the system calls performed by app. via OS interposition

interface

Security (4)

To protect the customer’s account and files

HTC must ensure that all resource agents are trustworthy

Placing data files only on trusty hosts

Using authentication mechanism

Encrypting network streams

Remote Customers

Remote access is more convenient than direct access

Customer creates an HTC account

Customer agent can be installed on customer workstation

The administrator allows this agent to access the HTC cluster

For non- trustworthy customers, extra security procedures may be

required

Condor

Condor is suitable for high throughput computations

Running many jobs at same time at different machines

Exploiting idle machines

Allowing for many jobs to be completed over a long period of

Useful for researchers that concern with number of jobs they

can do over particular time length

Condor

Running programs unattended and in the background

Redirecting console input & outputs from and to files

Notifying on completion via email

Allowing tracking jobs’ progress

Running one job on multiple machines

Survive hardware and software failure

Allowing joining and leaving of machines

Enforcing your own policy

10/9/2013

Condor

Condor can be seen as a distributed job scheduler

Scheduling submitted jobs on available machines

Allowing users to specify priorities to their jobs

Ensuring fair resources share by constantly calculating user priority

Lower numerical value means higher priority

Each user starts with the highest priority (0.5)

Priority improves over time if number of used machines < priority

Priority worsens over time if number of used machines > priority

Using of checkpoints

Suspending and resuming of jobs

Rescheduling jobs on different machines

Condor as Distributed Job Scheduler

Distributed Job Scheduler

Machine expresses

Attributes

Conditions

Preferences

Job expresses

Attributes

Requirements

Preferences

Matchmaker

Finds matching

Notifies the matched parties

ClassAd Language

Describing jobs, workstations, and other resources

Same idea of classified advertising section of news paper

Exchangeable between

processes to schedule jobs

Providing information about

the state of the system

ClassAd Structure

Set of attribute-values pairs

Each value can be:

Integer

Floating point

String

Logical expression

UNDEFINED

ClassAd Example:

Also, attributes from different ClassAds can be used

For Example: other.size > 3

10/9/2013

Matchmaker:

Its job to find matching between two ClassAds (job & machine)

Matching between two ClassAds (job & machine) occurs if

The expressions of Requirements attribute in both ClassAds are true

If more than two matches are found?

Rank is used

Condor: How to submit the job

Distributed Job Scheduler Job Submission Can be done by submit a job description file Job description file It is plain ASCII text file used to describe job or cluster (several jobs)

Specify how many times to run the job

Specify the directory of the input and output files

Specify how to receive notification when completing the execution (email or log)

Select an Universe

Standard or Vanilla

GLOBUS (Grid applications)

Scheduler (meta-schedulers)

Description file Example:

10/9/2013

Distributed Job Scheduler Distributed Job Scheduler

Standard Universe

Running serial jobs

Not supporting Checkpoint at kernel level

Relinking source code with Condor system call library

Transparently processing & restarting checkpoint

Transparently processing migration

Automatically using remote access mechanism

By default, storing checkpoint on local disk of submit machine

Configurable, it can be stored on checkpoint server

Standard Universe

Remote file access

Vanilla Universe

Running almost all serial jobs

Running any program that can run outside of Condor

Typically relying on shared file system between submit

machine and other nodes

If no shared file system, files will be transferred

MPI Universe

Managing parallel programs written using MPI

Uses only dedicated resources

PVM Universe

Giving the ability to submit PVM applications

PVM can ask Condor to add new machine

10/9/2013

Condor: dependencies between jobs

DAGMan Scheduler Using directed acyclic graph(DAG) to specify dependencies

DAGMan Scheduler Managing the submissions of jobs

10/9/2013

Condor Architecture

Condor Pool? Pool owns central manager and a collection of jobs and machines

Central manager serves as centralized repository of info about the state of the pool

Condor Architecture

Job Startup

Condor Architecture

INFN Condor pool

Grid overview What is Grid?

“Flexible, secure, coordinated resource sharing among dynamic

collections of individuals, institutions, and resources”

Virtual Organization (VO)

Dynamic Set of individuals and institutions defined by sharing rules to share their resources to achieve a common goal

Example of VO A crisis management team, the databases, and simulation systems are used to plan a response to an emergency situation

Grid overview

10/9/2013

Grid overview VO requirements

Flexible sharing relationships

Control on shared resources

Usage modes

Shared infrastructure services

Interoperability

Since Grid technology provides a general resource-sharing framework, it can be used to address the VO requirements

Grid Architecture Grid architecture must be formed as layers with hourglass shape

Each layer contains

component sharing

the same role

Component in each

layer can use services

of lower layers

The interacting between

components can be done

through standard protocols

Grid Architecture Fabric

Interface to local control

Implement the local, resource–specific operations

Implement resources Enquiry and resource management mechanisms to have the capability

Computational: monitoring and controlling process execution

Storage: read and writes files

Network: have control over network resources

Code Repository: managing versioned source code

Grid Architecture

Connectivity

Defines core communication and authentication protocols.

To exchange data between Fabric layer resources.

Authentication solutions :

logon once and have access to multiple Grid resources

Delegation

User-based trust relationships

Grid Architecture

Resource

Sharing Single Resources

Defines protocols for secure negotiation, initiation, monitoring, control, accounting, payment of sharing operations on individual resources.

Two primary classes of Resource layer protocols

Information protocols

To provide information about the structure and state of a resource

Management protocols

To negotiate access to a shared resource

Grid Architecture

Collective

Coordinating Multiple Resources

Defines protocols that capture interactions across collections of resources.

Service examples: Directory services Co-allocation, scheduling, and brokering services Monitoring and diagnostics services Data replication services Grid-enabled programming systems Workload management systems and collaboration frameworks Software discovery services Community authorization servers

10/9/2013

Grid Architecture

Application

Implements the business logic

Operate within VO environment

Constructed by calling services defined at any layer.

Globus Toolkit

Globus

Community of organizations and individuals developing fundamental

technologies behind the Grid

Globus Toolkit

Open source software toolkit provides basic infrastructure, protocols,

and services to build grids and applications

Globus Toolkit

Who is involved in Globus Alliance?

Argonne National Laboratory’s Mathematics and Computer Science Division

The University of Southern California’s Information Sciences Institute

The University of Chicago's Distributed Systems Laboratory

The University of Edinburgh in Scotland

The Swedish Center for Parallel Computers

National Computational Science Alliance

The NASA Information Power Grid project

Globus Toolkit

Projects using Globus Toolkit

Computer Science

Condor

DOE e-Services

GridLab

GriPhyN

NMI GridShib

NMI Performance Monitoring

OGSA-DAI

SciDAC CoG

SciDAC Data Grid

SciDAC Security

vGRADS

Physics

FusionGrid

Particle Physics Data Grid

Infrastructure

ASCI (HPSS)

GRIDS Center

NorduGrid

Open Science Grid

TeraGrid

UK e-Science

Astronomy

Sloan Digital Sky Survey

National Virtual

Observatory

Chemistry

Civil Engineering

Climate Studies

Earth System Grid

Collaboration

Access Grid

Globus Toolkit

The Toolkit

Include a set of services and software components to

support building Grids and their applications

Includes a set of modules

Each module provides an interface used by higher-

level services to invoke the module’s mechanisms.

Each module provides implementations that use low-level operations to give the ability to implement these mechanisms in different environments

10/9/2013

Globus Toolkit

Fabric:

Any resources that can be shared.

For example: Distributed file system and condor

Resources defined by vendor-supplied interfaces

Includes enquiry software to detect

resources capabilities and

delivers these information

to higher lever services

Globus Toolkit

Connectivity

Grid Security Infrastructure (GSI) Nexsus

Globus Toolkit

Resource

Grid Resource Access Management (GRAM) Grid Resource Information Protocol (GRIP) Grid Resource Registration Protocol (GRRP) GridFTP

Globus Toolkit

Connectivity

Grid Information Index Servers (GIISs) LDAP information protocol Dynamically Updated Request Online Co-allocator (DUROC)

Commonalities & Contrast

Commonalities

Using dedicated & non-dedicated resources

Providing powerful capacity

Contrast

Globus provides tools to to build girds, while Condor is software

that exploits resources of workstations to perform extensive tasks

Condor and Globus complementary technologies

Condor-G, a Globus-enabled version of Condor

10/9/2013

Inefficiencies & Possible

Problems in Condor:

One central manager

One checkpoint server

Possible Solution:

For each one, we should have mirror server that can be used

in case of crashing the original server

Parallelism Of Machine Learning Algorithmscs655/lectures/CS655-WalidBudgaga...Surviving failures...

Documents

Relentless Parallelism

Exploiting Parallelism

Parallelism - GoGMATlms.gogmat.com/files/2/gmat/Parallelism.pdf · Parallelism Arguably, the GMAT s favorite grammar topic is parallelism. According to the principle of parallelism,

Module 4 Data Link Layertozsu/courses/CS655/course... · CS655! 4-3! Link Layer: Introduction! Terminology:! • hosts and routers are nodes! • communication channels that connect

Data Parallelism & Matrix Multiplicationhomepages.math.uic.edu/~jan/mcs572/datparmatmul.pdfData Parallelism and Matrix Multiplication 1 Data Parallelism matrix-matrix multiplication

Parallelism answers

Parallelism: Avoiding Faulty Parallelism

Parallelism and Faulty Parallelism Game quiz

Programming for Swarm CS655 Course Project Weilin Zhong

SPECULATIVE PARALLELISM

Expressing Parallelism

Chapter 16 - Instruction-Level Parallelism and Superscalar ...web.ist.utl.pt/luis.tarrataca/classes/computer...Design Issues Machine Parallelism Machine Parallelism Machine parallelism

ภาษาอังกฤษธุรกิจ Parallelism

Replication in Databases and Distributed Systems - …cs655/lectures/CS655-Louis_Rabiet...Replication in Databases and Distributed Systems Course: CS655 ... Trans Per sec Nb Actions

1. Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism

Hardware Parallelism vs. Software Parallelism · 2019-02-25 · Hardware shared library ... Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism

PARALLELISM-DRIVENPERFORMANCEANALYSIS ...santosh.nagarakatte/... · PARALLELISM-DRIVENPERFORMANCEANALYSIS TECHNIQUESFORTASKPARALLELPROGRAMS by ADARSHYOGA Adissertationsubmittedtothe

Lecture 23: Thread Level Parallelism --Introduction, SMP ... · 2 Topics for Thread Level Parallelism (TLP) § Parallelism (centered around … –Instruction Level Parallelism –Data

Chapter 21: Parallel Databases. 21.2 Chapter 21: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation

PARALLELISM PARALLELISM PARALLELISM