46
Programming the CoW! Tools to start with on Tools to start with on the new cluster. the new cluster.

Programming the CoW! Tools to start with on the new cluster

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Programming the CoW!

Tools to start with on the new cluster.Tools to start with on the new cluster.

What’s it good for?

Net DOOM?

It should be good for computation, and to a lesser extent visualization.

It’s a shame about Ray.RAY CoW

$1.5 million.$1.5 million. $150 thousand.$150 thousand.

32 R12K 400MHz, 8MB $32 R12K 400MHz, 8MB $ 64 Xeon 1.7GHz, 256K $.64 Xeon 1.7GHz, 256K $.

SPEC(base) SPEC(base)

INT=328 FP=382INT=328 FP=382

SPEC(base) SPEC(base)

INT=579 FP=656INT=579 FP=656

16GB RAM total16GB RAM total 32GB RAM total32GB RAM total

InfiniteReality2E GFX pipes.InfiniteReality2E GFX pipes. 32 GeForce3 GFX cards.32 GeForce3 GFX cards.

IRIX64 6.5IRIX64 6.5   Linux 2.4.18Linux 2.4.18

So the CoW is ~4x faster right?

Fortunately for SGI, no!RAY CoW

Uses the system bus.Uses the system bus. Uses 1G and 100M ethernet Uses 1G and 100M ethernet cards.cards.

335ns avg latency from 335ns avg latency from processor to remote memory.processor to remote memory.

~59us latency onto the net.~59us latency onto the net.

10GB/sec sustained 10GB/sec sustained bandwidth.bandwidth.

.5GB/sec sustained bandwith..5GB/sec sustained bandwith.

So it should be great for high granularity computations.

That is, design your programs to have long processing cycles and infrequent inter-node communication needs, and you should be just fine.

How do we program it?

Shared MemoryShared Memory – – A global memory space is available to all nodes.A global memory space is available to all nodes.Nodes use synchronization primitives to avoid contention.Nodes use synchronization primitives to avoid contention.

   Message PassingMessage Passing – – Every node has only private memory space. Every node has only private memory space. All communications between nodes have to be explicitly All communications between nodes have to be explicitly directed.directed.

MEMORY

NODE NODE NODE NODE NODE

MEM

NODE NODE NODE NODE NODE

MEM MEM MEM MEM

L R RESx =

Thread Matrix Multiply

Workers split L, and each multiplies with all of R to get a part of RES.

Thread Matrix Multiply Example

On the cluster we have no hardware support On the cluster we have no hardware support for SM, so MP is the natural alternative.for SM, so MP is the natural alternative.

Unix supports sockets for MP.Unix supports sockets for MP.

People have built higher level MP libraries People have built higher level MP libraries out of sockets that make life easier.out of sockets that make life easier.

Two that I am familiar with are PVM and Two that I am familiar with are PVM and MPI.MPI.

PVM: Parallel Virtual Machine.

Started in 1989.Started in 1989.

http://www.csm.ornl.gov/pvmhttp://www.csm.ornl.gov/pvm

A PVM is a virtual machine made of a A PVM is a virtual machine made of a collection of independent nodes.collection of independent nodes.

It has a lot of support for heterogeneous It has a lot of support for heterogeneous clusters.clusters.

It’s easy to use, and maybe lower performing It’s easy to use, and maybe lower performing than MPI.than MPI.

PVM

Each node runs one pvmd daemon.Each node runs one pvmd daemon.

Each node can run one or more tasks.Each node can run one or more tasks.

Tasks use the pvmd to communicate with other tasks.Tasks use the pvmd to communicate with other tasks.

Task can start new tasks, stop tasks, or delete nodes Task can start new tasks, stop tasks, or delete nodes from the PVM at will.from the PVM at will.

Tasks can be grouped.Tasks can be grouped.

PVM comes with a console program that lets you PVM comes with a console program that lets you control the PVM easily.control the PVM easily.

PVM: Setup

#Where PVM is installed.#Where PVM is installed.

setenv PVM_ROOT /home/demarle/sci/distrib/mps/pvm3 setenv PVM_ROOT /home/demarle/sci/distrib/mps/pvm3

#What type of machine this node is.#What type of machine this node is.

setenv PVM_ARCH LINUX setenv PVM_ARCH LINUX

#Where the ssh command is.#Where the ssh command is.

setenv PVM_RSH /local/bin/ssh setenv PVM_RSH /local/bin/ssh

#Where your PVM applications are.#Where your PVM applications are.

setenv PVMBIN $PVM_ROOT/bin/LINUX setenv PVMBIN $PVM_ROOT/bin/LINUX

#Where the pvm executables are.#Where the pvm executables are.

setenv PATH ${PATH}:$PVM_ROOT/lib setenv PATH ${PATH}:$PVM_ROOT/lib

setenv PATH ${PATH}:$PVM_ROOT/bin/LINUXsetenv PATH ${PATH}:$PVM_ROOT/bin/LINUX

PVM CONSOLE:

[demarle@labnix13 scisem]$ pvmpvm> add labnix14add labnix141 successful HOST DTID labnix14 80000pvm> confconf2 hosts, 1 data format HOST DTID ARCH SPEED DSIG labnix14 40000 LINUX 1000 0x00408841 labnix13 80000 LINUX 1000 0x00408841pvm> quitquitConsole: exit handler calledpvmd still running.[demarle@labnix13 scisem]$

[demarle@labnix13 scisem]$ cord_racerSuspended[demarle@labnix13 scisem]$ pvmpvmd already running.pvm> psps HOST TID FLAG 0x COMMAND labnix13 40016 4/c - labnix13 40017 6/c,f adsmd

use "pvm> help" to get a list of commands.use "pvm> kill" to kill tasks use "pvm> delete" to delete nodes from the PVM.use "pvm> halt" to stop every pvm task and daemon.

PVM CONSOLE, continued:

PVM: IMPORTANT LIBRARY CALLS

pvm_spawn()pvm_spawn() Task starts children.Task starts children. PvmTaskDebug argument starts them under gdb.PvmTaskDebug argument starts them under gdb.

pvm_catchout()pvm_catchout() Task outputs the terminal output of all children.Task outputs the terminal output of all children.

pvm_mytid()pvm_mytid() What is my task id?What is my task id?

pvm_parent()pvm_parent() What is my parent's task id?What is my parent's task id?

PVM: IMPORTANT LIBRARY CALLS

pvm_initsend()pvm_initsend() Clear default buffer and prepare it to send.Clear default buffer and prepare it to send.

pvm_packf()pvm_packf() Put data into a send buffer.Put data into a send buffer.

pvm_send() pvm_send() Transmit a buffer.Transmit a buffer.

pvm_recv()pvm_recv() Receive data into a buffer.Receive data into a buffer.

pvm_nrecv()pvm_nrecv() Non blocking receive.Non blocking receive.

pvm_unpackf()pvm_unpackf() Move data from a buffer into variables.Move data from a buffer into variables.

PVM: IMPORTANT LIBRARY CALLS

pvm_joingroup()pvm_joingroup() Add this task to a group.Add this task to a group.

pvm_lvgroup()pvm_lvgroup() Remove this task from a group.Remove this task from a group.

pvm_bcast()pvm_bcast() Broadcast a buffer to all members of a group.Broadcast a buffer to all members of a group.

pvm_barrier()pvm_barrier() Wait here until all other tasks in the group are also Wait here until all other tasks in the group are also here.here.

pvm_reduce()pvm_reduce() Perform a global operation. Ex. max, each tasks give Perform a global operation. Ex. max, each tasks give a number, one reduces the values to the max value.a number, one reduces the values to the max value.

PVM: IMPORTANT LIBRARY CALLS

pvm_config()pvm_config() What machines are in the PVM?What machines are in the PVM?

pvm_addhosts()pvm_addhosts() Add a machine to the PVM.Add a machine to the PVM.

pvm_delhosts()pvm_delhosts() Remove a machine from the PVM.Remove a machine from the PVM.

pvm_tasks() pvm_tasks() What tasks are running on the PVM?What tasks are running on the PVM?

pvm_exit()pvm_exit() Remove this task from the PVM.Remove this task from the PVM.

L R RESx =

Message Passing Matrix Multiply

Workers split L and R.

They always multiply their L’, and take turns broadcasting their R’.

PVM Matrix Multiply Example

MPI: Message Passing Interface.Started in 1992.Started in 1992.

http:// http:// www-unix.mcs.anl.gov/mpi/index.htmlwww-unix.mcs.anl.gov/mpi/index.html

Goal - to standardize message passing so that Goal - to standardize message passing so that parallel code can be portable.parallel code can be portable.

Unlike PVM it does not specify the virtual machine Unlike PVM it does not specify the virtual machine environment.environment.

For instance, it does say how to start a program.For instance, it does say how to start a program.

It has more basic operations than PVM.It has more basic operations than PVM.

It's supposed to be lower level and faster.It's supposed to be lower level and faster.

MPICH

A free implementation of the MPI standard.A free implementation of the MPI standard.http://www-unix.mcs.anl.gov/mpi/mpichhttp://www-unix.mcs.anl.gov/mpi/mpich

+ it comes with some extras, like scripts that give + it comes with some extras, like scripts that give you some of PVM’s niceties.you some of PVM’s niceties.

mpirun - a script to start your programs with.mpirun - a script to start your programs with. mpicc, mpiCC, mpif77, and mpif90.mpicc, mpiCC, mpif77, and mpif90. MPE – a set of performance analysis and program MPE – a set of performance analysis and program

visualization tools.visualization tools.

MPI: Setup#where MPI is installed.#where MPI is installed.

setenv MYMPI /home/demarle/sci/distrib/mps/mpi/mpich-1.2.3setenv MYMPI /home/demarle/sci/distrib/mps/mpi/mpich-1.2.3

#Where the ssh command is.#Where the ssh command is.

setenv RSHCOMMAND /local/bin/sshsetenv RSHCOMMAND /local/bin/ssh

#where the executables are.#where the executables are.

setenv PATH ${PATH}:${MYMPI}/binsetenv PATH ${PATH}:${MYMPI}/bin

Uses a file to specify which machines you can use.Uses a file to specify which machines you can use.

${MYMPI}/util/machines/machines.LINUX${MYMPI}/util/machines/machines.LINUX

To start an executable:To start an executable:

mpirun <-dbg-gdb> -np # filenamempirun <-dbg-gdb> -np # filename

MPI: IMPORTANT LIBRARY CALLS

MPI_Init()MPI_Init() Begin the MPI session for this task.Begin the MPI session for this task.

MPI_Finalize()MPI_Finalize() Leaves MPI.Leaves MPI.

MPI_Comm_create()MPI_Comm_create() Creates a Communicator, something like a group Creates a Communicator, something like a group of groups.of groups.

MPI_Comm_size()MPI_Comm_size() How many tasks are in the Communicator?How many tasks are in the Communicator?

MPI_Comm_rank()MPI_Comm_rank() Which task am I?Which task am I?

MPI_Comm_group()MPI_Comm_group() Access a specific group in the Communicator.Access a specific group in the Communicator.

MPI_Graph_get()MPI_Graph_get() Query topology of a Communicator.Query topology of a Communicator.

MPI: IMPORTANT LIBRARY CALLS

MPI_Group_size()MPI_Group_size() What is the size of a group?What is the size of a group?

MPI_Group_rank()MPI_Group_rank() What is this task’s place in the group?What is this task’s place in the group?

MPI_Barrier()MPI_Barrier() Wait for all tasks in the group to catch up.Wait for all tasks in the group to catch up.

MPI_Bcast()MPI_Bcast() Broadcast a message to all others in the group.Broadcast a message to all others in the group.

MPI_Reduce()MPI_Reduce() Performs an operation across a group’s values.Performs an operation across a group’s values.

MPI_File_*()MPI_File_*() Group wide file operations.Group wide file operations.

MPI: IMPORTANT LIBRARY CALLSMPI_Pack()MPI_Pack() puts data into a buffer for a later sendputs data into a buffer for a later send

MPI_Send()MPI_Send() Send a buffer.Send a buffer.

MPI_Isend()MPI_Isend() Non blocking send.Non blocking send.

MPI_Probe()MPI_Probe() Test for an incoming buffer.Test for an incoming buffer.

MPI_Iprobe()MPI_Iprobe() Non blocking test.Non blocking test.

MPI_Recv()MPI_Recv() Receive a buffer.Receive a buffer.

MPI_Irecv()MPI_Irecv() Non blocking receive.Non blocking receive.

MPI_Unpack()MPI_Unpack() gets data from a received buffergets data from a received buffer

If you don't want the overhead of the PVM If you don't want the overhead of the PVM and MPI libraries and daemons, you can do and MPI libraries and daemons, you can do essentially the same thing with sockets.essentially the same thing with sockets.

Sockets will be faster, but also harder to use. Sockets will be faster, but also harder to use. They don’t come with groups, barriers, They don’t come with groups, barriers, reductions, etc. You have to create these reductions, etc. You have to create these yourself.yourself.

SOCKETSThink of file desriptors: sock = socket() ~ fd = fopenThink of file desriptors: sock = socket() ~ fd = fopen

int sock = socket(int sock = socket(DomainDomain, , TypeType, , ProtocolProtocol););

DomainDomain

AF_INET AF_INET over the netover the net

AF_UNIX AF_UNIX local to a nodelocal to a node

TypeType

SOCK_STREAM SOCK_STREAM 2ended connections, reliable, no limit.2ended connections, reliable, no limit.

ie TCPie TCP

SOCK_DGRAM SOCK_DGRAM connectionless, unreliable, ~1500 connectionless, unreliable, ~1500 bytesbytes

ie UDPie UDP

ProtocolProtocol - like a flavor of the domain, these two just take 0 - like a flavor of the domain, these two just take 0

Basic Process for a Master Task

//open a socket, like a file descriptor//open a socket, like a file descriptor

sock=socket(AF_INET, SOCK_STREAM, 0);sock=socket(AF_INET, SOCK_STREAM, 0);

//bind your end to this machine's IP address and this programs PORT//bind your end to this machine's IP address and this programs PORT

int ret = bind (sock, (struct sockaddr *) &servAddr, sizeof(servAddr));int ret = bind (sock, (struct sockaddr *) &servAddr, sizeof(servAddr));

//let the socket listen for connections from remote machines//let the socket listen for connections from remote machines

ret = listen(sock, BACKLOG);ret = listen(sock, BACKLOG);

//start remote programs//start remote programs

system("ssh labnix14 worker.exe");system("ssh labnix14 worker.exe");

TO BE CONTINUED …TO BE CONTINUED …

Basic Process for a Worker//put yourself in background and nohup, to let the master continue//put yourself in background and nohup, to let the master continue

ret = daemon(1,0);ret = daemon(1,0);

//open a socket//open a socket

int sock = socket(AF_INET,SOCK_STREAM,0);int sock = socket(AF_INET,SOCK_STREAM,0);

//bind your end with this machine's IP address and this program’s PORT//bind your end with this machine's IP address and this program’s PORT

ret = bind(sock, (struct sockaddr *) &cliAddr, sizeof(cliAddr));ret = bind(sock, (struct sockaddr *) &cliAddr, sizeof(cliAddr));

//connect this socket to the listening one in the master//connect this socket to the listening one in the master

ret = connect(sock, (struct sockaddr *) &servAddr, sizeof(servAddr)); ret = connect(sock, (struct sockaddr *) &servAddr, sizeof(servAddr));

TO BE CONTINUED…TO BE CONTINUED…

Basic Process for a Master Task, cont.

//accept each worker’s connection to finish a new two ended socket//accept each worker’s connection to finish a new two ended socket..

children[c].sock = accept(sock, children[c].sock = accept(sock,

(struct sockaddr *)&children[c].cliAddr, (struct sockaddr *)&children[c].cliAddr, &children[c].cliAddrLen &children[c].cliAddrLen

););

//send and receive over the socket as you like//send and receive over the socket as you like

ret = send(children[c].sock, parms, 8*sizeof(double), 0);ret = send(children[c].sock, parms, 8*sizeof(double), 0);

ret = recv(children[c].sock, RES+rr*rsc, rpr*rpc, MSG_WAITALL);ret = recv(children[c].sock, RES+rr*rsc, rpr*rpc, MSG_WAITALL);

//close the sockets when you are done with them//close the sockets when you are done with them

close(children[c].sock);close(children[c].sock);

Basic Process for a Worker, cont.

//send and receive data as you please//send and receive data as you please

ret = recv(sock, parms, 7*sizeof(int), 0);ret = recv(sock, parms, 7*sizeof(int), 0);

ret = send(sock, (void *)RET, len2, 0);ret = send(sock, (void *)RET, len2, 0);

//close the socket when you are done with it//close the socket when you are done with it

close(sock);close(sock);

Shared Memory on cluster?

SM code was so much simpler.SM code was so much simpler.

So a lot of people have built DSM Systems.So a lot of people have built DSM Systems.

Adsmith, CRL, CVM, DIPC, DSM-PM2, Adsmith, CRL, CVM, DIPC, DSM-PM2, PVMSYNC, Quarks, SENSE, TreadMarks PVMSYNC, Quarks, SENSE, TreadMarks to name a few…to name a few…

Two types of Software DSMs

PAGE Based DSMs

Use of the Virtual Memory Manager.Use of the Virtual Memory Manager. Install a signal handler to catch segfaults.Install a signal handler to catch segfaults.

Use mprotect to protect virtual memory pages assigned to Use mprotect to protect virtual memory pages assigned to remote nodes.remote nodes.

On a segfault - the process blocks - the segfault handler gets On a segfault - the process blocks - the segfault handler gets a page from a remote node – returns to the process.a page from a remote node – returns to the process.

It suffers when two or more nodes want to write to differentIt suffers when two or more nodes want to write to differentand unrelated places on the same memory page.and unrelated places on the same memory page.

Object Based DSMs

Let the programmer define the unit of sharing and Let the programmer define the unit of sharing and then provide each shared object with something then provide each shared object with something like load, modify and save methods.like load, modify and save methods.

They can eliminate false sharing, but they often They can eliminate false sharing, but they often aren’t as easy to use.aren’t as easy to use.

DIPC

Distributed Inter Process CommunicationDistributed Inter Process Communication Page Based.Page Based. It’s an extension to the Linux KernelIt’s an extension to the Linux Kernel

Specifically it extends SYSTEM V IPCSpecifically it extends SYSTEM V IPC

SYSTEM V IPC? Like an alternative to threads, it lets arbitrary Like an alternative to threads, it lets arbitrary

unrelated processes work together.unrelated processes work together.

Threads share the program's entire global space.Threads share the program's entire global space.

For shmem, processes explicitly declare what is For shmem, processes explicitly declare what is shared.shared.

SYSTEM V IPC also means messages and SYSTEM V IPC also means messages and semaphores.semaphores.

Basic idea//create an object to share//create an object to sharevolatile struct shared { int i; } *shared;volatile struct shared { int i; } *shared;

//make the object shareable//make the object shareableshmid = shmget(IPC_PRIVATE, shmid = shmget(IPC_PRIVATE,

sizeof(struct shared), sizeof(struct shared), (IPC_CREAT | 0600));(IPC_CREAT | 0600));

shared = ((volatile struct shared *) shmat(shmid, 0, 0));shared = ((volatile struct shared *) shmat(shmid, 0, 0));shmctl(shmid, IPC_RMID, 0);shmctl(shmid, IPC_RMID, 0);

//start children, now they don't have copies of “shared”, they all actually //start children, now they don't have copies of “shared”, they all actually access the original one.access the original one.

fork()fork()

//all children can access the shared whenever they want//all children can access the shared whenever they wantshared->i = 0;shared->i = 0;

How would this change for DIPC?

#define IPC_DIPC 00010000 #define IPC_DIPC 00010000

shmid = shmget(IPC_PRIVATE, shmid = shmget(IPC_PRIVATE, sizeof(struct shared), sizeof(struct shared), (IPC_CREAT | IPC_DIPC | 0600)(IPC_CREAT | IPC_DIPC | 0600) ););

//Same thing applies for semget and msgget.//Same thing applies for semget and msgget.

DIPC works by adding a small modification to the DIPC works by adding a small modification to the Linux kernel.Linux kernel.

The kernel looks for IPC_DIPC structures, and The kernel looks for IPC_DIPC structures, and bumps them out to a user level daemon. Structures bumps them out to a user level daemon. Structures without the flag are treated normally.without the flag are treated normally.

The daemon satisfies the request over the network, The daemon satisfies the request over the network, and then returns the data to the kernel. Which in and then returns the data to the kernel. Which in turn returns the data to the user process.turn returns the data to the user process.

The great thing about DIPC is that it is very The great thing about DIPC is that it is very compatible with normal Linux.compatible with normal Linux.

A DIPC program will run just fine on an isolated A DIPC program will run just fine on an isolated machine without DIPC, the flag will just be machine without DIPC, the flag will just be ignored.ignored.

This means you can develop your software off the This means you can develop your software off the cluster and then just throw it on to make use of all cluster and then just throw it on to make use of all the CPU's.the CPU's.

DIPC Problems?

Does strict sequential consistency, which is very Does strict sequential consistency, which is very easy to use but wastes a lot of network traffic.easy to use but wastes a lot of network traffic.

The version for the 2.4.X kernel isn't finished yet.The version for the 2.4.X kernel isn't finished yet.

Summary

CPU CPU , COMMUNICATIONS , COMMUNICATIONS MP: PVM, MPI, SOCKETSMP: PVM, MPI, SOCKETS DSM: DIPC?, Quarks?, …DSM: DIPC?, Quarks?, …

REFERENCESPVMPVM

http://www.csm.ornl.gov/pvmhttp://www.csm.ornl.gov/pvm

MPIMPI

http://www-unix.mcs.anl.gov/mpi/index.htmlhttp://www-unix.mcs.anl.gov/mpi/index.html

MPICHMPICH

http://www-unix.mcs.anl.gov/mpi/mpichhttp://www-unix.mcs.anl.gov/mpi/mpich

DIPCDIPC

http://wallybox.cei.net/dipchttp://wallybox.cei.net/dipc