Advanced Charm++ Tutorial

Advanced Charm++ TutorialCharm Workshop Tutorial

Gengbin Zhengcharm.cs.uiuc.edu10/19/2005

Building Charm++Advanced messagingInterface file (.ci)GroupsDelegationArray multicastAdvanced load balancingThreadsSDAG

Charm++ on Parallel MachinesRuns on:Any machine with MPI, includingIBM SP, Blue Gene/LCray XT3Origin2000PSCs Lemieux (Quadrics Elan)Clusters with Ethernet (Udp/Tcp)Clusters with Myrinet (GM)Clusters with Amasso cardsApple clustersEven Windows!SMP-Aware (pthreads)

Communication ArchitectureConverseCommunication APINetMPIElanUDP(machine-eth.c)TCP(machine-tcp.c)Myrinet(machine-gm.c)Ammasso(machine-ammasso.c)BG/L

Compiling Charm++./buildUsage: build [charmc-options ...]

: converse charm++ LIBS AMPI FEM bluegene pose jade msa: doc ps-doc pdf-doc html-doc:bluegenel mpi-sol net-sol-amd64elan-axp mpi-sp net-sol-x86elan-linux-ia64 ncube2 net-sunexemplar net-axp net-win32mpi-axp net-cygwin origin2000mpi-bluegenel net-hp origin-pthreadsmpi-crayx1 net-hp-ia64 paragon-redmpi-crayxt3 net-irix shmem-axpmpi-exemplar net-linux sim-linuxmpi-hp-ia64 net-linux-amd64 sp3mpi-linux net-linux-axp t3empi-linux-amd64 net-linux-ia64 uth-linuxmpi-linux-axp net-ppc-darwin uth-win32mpi-linux-ia64 net-rs6k vmi-linuxmpi-origin net-sol vmi-linux-ia64mpi-ppc-darwin

: compiler and platform specific optionscc cc64 cxx kcc pgcc acc icc ecc gcc3 mpcc pathscalehelp smp gm tcp vmi scyld clustermatic bluegene ooc syncft papi--incdir --libdir --basedir --no-build-shared -j

: normal compiler options e.g. -g -O -save -verbose

To get more detailed help, run ./build --help

Build Options: compiler and platform specific optionscc cc64 cxx kcc pgcc acc icc ecc gcc3 mpcc pathscalehelp smp gm tcp vmi scyld clustermatic bluegene ooc syncft papi--incdir --libdir --basedir --no-build-shared -j

For platform specific options, use help option: help platform specific help, e.g. ./build charm++ net-linux help

Choose a compiler (only one option is allowed from this section): cc, cc64 For Sun WorkShop C++ 32/64 bit compilers cxx DIGITAL C++ compiler (DEC Alpha) kcc KAI C++ compiler pgcc Portland Group's C++ compiler acc HP aCC compiler icc Intel C/C++ compiler for Linux IA32 ecc Intel C/C++ compiler for Linux IA64 gcc3 use gcc3 - GNU GCC/G++ version 3 mpcc SUN Solaris C++ compiler for MPI pathscale use pathscale compiler suite

Platform specific options (choose multiple if apply): smp support for SMP, multithreaded charm on each node mpt use SGI Message Passing Toolkit (only for mpi version) gm use Myrinet for communication tcp use TCP sockets for communication (ony for net version) vmi use NCSA's VMI for communication (only for mpi version) scyld compile for Scyld Beowulf cluster based on bproc clustermatic compile for Clustermatic (support version 3 and 4)

Advanced options: bluegene compile for BigSim (Blue Gene) simulator ooc compile with out of core support syncft compile with Charm++ fault tolerance support papi compile with PAPI performance counter support (if any)

Charm++ dynamic libraries: --build-shared build Charm++ dynamic libraries (.so) (default) --no-build-shared don't build Charm++'s shared libraries

Miscellaneous options: --incdir=DIR specify additional include path for compiler --libdir=DIR specify additional lib path for compiler --basedir=DIR shortcut for the above two - DIR/include and DIR/lib -j[N] parallel make, N is the number of paralle make jobs --with-romio build AMPI with ROMIO library

Build ScriptBuild script does:./build [charmc-options ...] Creates directories and /tmp Copies src/scripts/Makefile into /tmp Does a "make OPTS=" in /tmp.That's all build does. The rest is handled by the Makefile.

How build worksbuild charm++ net-linux gm smp kcc bluegeneSort gm, smp and bluegeneMkdir net-linux-bluegene-gm-smp-kccCat conv-mach-[kcc|bluegene|gm|smp].h to conv-mach-opt.hCat conv-mach-[kcc|bluegene|gm|smp].sh to conv-mach-opt.shGather files from net, etc (Makefile)Make charm++ under net-linux-bluegene-gm-smp-kcc/tmp

How Charmrun Works?Charmruncharmrun +p4 ./pgm

Charmrun (batch mode)Charmruncharmrun +p4 ++batch 8

Debugging Charm++ ApplicationsPrintfGdbSequentially (standalone mode)gdb ./pgm +vp16Attach gdb manuallyRun debugger in xtermcharmrun +p4 pgm ++debugcharmrun +p4 pgm ++debug-no-pauseMemory paranoid-memoryParallel debugger

How to Become a Charm++ HackerAdvanced Charm++Advanced MessagingInterface files (ci)Writing system librariesGroupsDelegationArray multicastThreadsSDAG

Advanced Messaging

Prioritized ExecutionCharm++ schedulerDefault - FIFO (oldest message)Prioritized executionIf several messages available, Charm will process the messages in the order of their prioritiesVery useful for speculative work, ordering timestamps, etc...

Priority ClassesCharm++ scheduler has three queues: high, default, and lowAs signed integer priorities:-MAXINT Highest priority -- -10 Default priority1 -- +MAXINT Lowest priorityAs unsigned bitvector priorities:0x0000 Highest priority -- 0x7FFF0x8000 Default priority0x8001 -- 0xFFFF Lowest priority

Prioritized MessagesNumber of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages

Signed integer priorities:*CkPriorityPtr(msg)=-1;CkSetQueueing(m, CK_QUEUEING_IFIFO);Unsigned bitvector prioritiesCkPriorityPtr(msg)[0]=0x7fffffff;CkSetQueueing(m, CK_QUEUEING_BFIFO);

Prioritized Marshalled MessagesPass CkEntryOptions as last parameterFor signed integer priorities:CkEntryOptions opts;opts.setPriority(-1);fooProxy.bar(x,y,opts);

For bitvector priorities:CkEntryOptions opts;unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF};opts.setPriority(64,prio);fooProxy.bar(x,y,opts);

Advanced Message FeaturesRead-only messagesEntry method agrees not to modify or delete the messageAvoids message copy for broadcasts, saving timeInline messagesDirect invocation if on local processorExpedited messagesMessage do not go through the charm++ scheduler (faster)Immediate messagesEntries are executed in an interrupt or the communication threadVery fast, but tough to get rightImmediate messages only currently work for NodeGroups and Group (non-smp)

Read-Only, Expedited, ImmediateAll declared in the .ci file{ entry [nokeep] void foo_readonly(Msg *); entry [inline] void foo_inl(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *); .. .. .. };

Interface File (ci)

Interface File Example mainmodule hello { include myType.h initnode void myNodeInit(); initproc void myInit();

mainchare mymain { entry mymain(CkArgMsg *m); };

array[1D] foo { entry foo(int problemNo); entry void bar1(int x); entry void bar2(myType x); }; };

Include and InitcallIncludeInclude an external header filesInitcallUser plugging code to be invoked in Charm++s startup phaseInitnodeCalled once on every nodeInitprocCalled once on every processorInitnode calls are called before Initproc calls

Entry AttributesThreadedFunction is invoked in a CthThreadSyncBlocking methods, can return values as a messageCaller must be a threadExclusiveFor Node GroupDo not execute while other exclusive entry methods of its node group are executing in the same node NotraceInvisible to trace projectionsentry [notrace] void recvMsg(multicastGrpMsg *m);

Groups/Node Groups

Groups and Node GroupsGroups - a collection of objects (chares)Also called branch office chares (BOC)Exactly one representative on each processorIdeally suited for system librariesSimilar to arrays:Broadcasts, reductions, indexingBut not completely like arrays:Non-migratable; one per processorNode GroupsOne per nodeSMP

Declarations.ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; nodegroup mynodegroup { entry mynodegroup(); //Constructor entry void foo(foomsg *); //Entry method };C++ fileclass mygroup : public Group {mygroup() {} void foo(foomsg *m) { CkPrintf(Do Nothing);}};class mynodegroup : public NodeGroup {mynodegroup() {} void foo(foomsg *m) { CkPrintf(Do Nothing);}};

Creating and Calling GroupsCreationp = CProxy_mygroup::ckNew();Remote invocationp.foo(msg); //broadcastp[1].foo(msg); //asynchronousp.foo(msg, npes, pes); // list sendDirect local access mygroup *g=p.ckLocalBranch();g->foo(.); //local invocationDanger: if you migrate, the group stays behind!

Delegation

DelegationCustomized implementation of messagingEnables Charm++ proxy messages to be forwarded to a delegation manager groupDelegation managertrap calls to proxy sends and apply optimizationsDelegation manager must inherit from CkDelegateMgr classUser program must to callproxy.ckDelegate(mgrID);

Delegation Interface.ci filegroup MyDelegateMgr {entry MyDelegateMgr(); //Constructor };.h fileclass MyDelegateMgr : public CkDelegateMgr {MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); .. ..}

Array Multicast

Array Multicast/reduction libraryArray section a subset of chare arrayArray section creationEnumerate array indicesCkVec elems; // add array indices for (int i=0; i

Array Section MulticastOnce have the array section proxydo multicast to all the section members: CProxySection_Hello proxy; proxy.foo( msg) // multicast send messages to one member using its local index proxy[0].foo( msg)

Array Section MulticastMulticast via delegationCkMulticast communication libraryCProxySection_Hello sectProxy = CProxySection_Hello::ckNew(); CkGroupID mCastGrpId = CProxy_CkMulticastMgr::ckNew(); CkMulticastMgr *mcastGrp = CProxy_CkMulticastMgr(mCastGrpId).ckLocalBranch(); sectProxy.ckSectionDelegate(mCastGrpId); // initialize proxy sectProxy.foo(...); //multicast via delegation Note, to use CkMulticast library, all multicast messages must inherit from CkMcastBaseMsg, as following: class HiMsg : public CkMcastBaseMsg, public CMessage_HiMsg { public: int *data; };

Array Section ReductionSection reduction with delegationUse default reduction callbackCProxySection_Hello sectProxy; CkMulticastMgr *mcastGrp = CProxy_CkMulticastMgr(mCastGrpId).ckLocalBranch(); mcastGrp->setReductionClient(sectProxy, new CkCallback(...)); ReductionCkGetSectionInfo(sid, msg);CkCallback cb(CkIndex_myArray::foo(NULL),thisProxy); mcastGrp->contribute(sizeof(int), &data, CkReduction::sum_int, sid, cb);

With MigrationWorks with migrationWhen intermediate nodes migrateWhen migrates, multicast tree will be automatically rebuiltRoot processorApplication needs to initiate the rebuildWill change to automatic in future

Advanced Load-balancers

Writing a Load-balancing Strategy

Advanced load balancing: Writing a new strategyInherit from CentralLB and implement the work() function class foolb : public CentralLB {public: .. .. .. void work (CentralLB::LDStats* stats, int count); .. .. .. };

LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; //.. .. .. }//Dummy Work function which assigns all objects to//processor 0//Dont implement it!void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count < nobjs; count++) stats.to_proc[count] = 0;}

Compiling and IntegrationEdit and run Makefile_lb.shCreates Make.lb which is included by the main MakefileRun make depends to correct dependenciesRebuild charm++ and is now available in balancer fooLB

Threads in Charm++

Why use Threads?They provide one key feature: blockingSuspend execution (e.g., at message receive)Do something elseResume later (e.g., after message arrives)Example: MPI_Recv, MPI_Wait semanticsFunction call interface more convenient than message-passingRegular call/return structure (no CkCallbacks) with complete control flowAllows blocking in middle of deeply nested communication subroutine

Why not use Threads?SlowerAround 1us context-switching overhead unavoidableCreation/deletion perhaps 10usMigration more difficultState of thread is scattered through stack, which is maintained by compilerBy contrast, state of object is maintained by usersThread disadvantages form the motivation to use SDAG (later)

Context Switch Cost

Chart1

1.3072420.8308391.311779

0.772860.8504550.90111

1.904790.8659861.908973

2.9926070.8660782.86693

3.7492480.8583123.363098

3.3885430.8599482.888108

3.7332930.852663.339983

3.0537650.8452612.897983

3.6421640.8448183.3543

3.1303910.8532292.964149

3.7288270.8575393.392727

3.194250.8557392.980291

3.2487630.8589673.03115

3.3423350.870833.121787

3.3860540.867923.197932

3.480980.8958183.229433

3.614770.9081363.27775

3.757890.9265193.326516

3.8153961.1218163.435723

4.0597821.3080543.492936

4.7890791.6634794.063329

5.5959521.731984.618952

6.2882311.8012575.293805

7.0109171.8426385.924449

8.4498921.8497287.207066

10.4577151.8832949.784505

6001.83777114.188138

7001.86157617.019212

8001.86302516.843023

9001.89366516.032364

10001.94382713.925054

15001.93452715.767073

20001.94339216.333523

25001.95793116.503304

30001.96230816.60465

40001.96885316.454164

50001.97579616.765595

60001.97304116.826278

70001.97321517.239769

80001.9763738000

90001.9781459000

100001.97730110000

150001.97759715000

Process

CthThreads

Pthreads

Number of flows

Context switch time (us)

Context Switching Time (Turing Apple Cluster)

Sheet1

11.3072420.8308391.311779

20.772860.8504550.90111

31.904790.8659861.908973

42.9926070.8660782.86693

53.7492480.8583123.363098

63.3885430.8599482.888108

73.7332930.852663.339983

83.0537650.8452612.897983

93.6421640.8448183.3543

103.1303910.8532292.964149

153.7288270.8575393.392727

203.194250.8557392.980291

303.2487630.8589673.03115

403.3423350.870833.121787

503.3860540.867923.197932

603.480980.8958183.229433

703.614770.9081363.27775

803.757890.9265193.326516

903.8153961.1218163.435723

1004.0597821.3080543.492936

1504.7890791.6634794.063329

2005.5959521.731984.618952

2506.2882311.8012575.293805

3007.0109171.8426385.924449

4008.4498921.8497287.207066

50010.4577151.8832949.784505

6001.83777114.188138

7001.86157617.019212

8001.86302516.843023

9001.89366516.032364

10001.94382713.925054

15001.93452715.767073

20001.94339216.333523

25001.95793116.503304

30001.96230816.60465

40001.96885316.454164

50001.97579616.765595

60001.97304116.826278

70001.97321517.239769

80001.976373

90001.978145

100001.977301

150001.977597

Sheet1

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

000

Process

CthThreads

Pthreads

Number of flows

Context switch time (us)

Sheet2

Sheet3

What are (Converse) Threads?One flow of control (instruction stream)Machine Registers & program counterExecution stackLike pthreads (kernel threads)Only different:Implemented at user level (in Converse)Scheduled at user level; non-preemptiveMigratable between nodes

How do I use Threads?Many options:AMPI Always uses threads via TCharm libraryCharm++[threaded] entry methods run in a thread[sync] methodsConverseC routines CthCreate/CthSuspend/CthAwakenEverything else is built on theseImplemented usingSYSV makecontext/setcontextPOSIX setjmp/alloca/longjmpAssembly code

How do I use Threads (example)Blocking API routine: find array elementint requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src)}Send request and suspendint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return;}Awaken thread when data arrivesvoid myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}

How do I use Threads (example)

Send request, suspend, recv, awaken, returnint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend();

return stashed_return;}void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}

Thread Migration

Stack DataThe stack is used by the compiler to track function calls and provide temporary storageLocal VariablesSubroutine ParametersC alloca storageMost of the variables in a typical application are stack dataStack is allocated by Charm run-time as heap memory (+stacksize)

Migrate Stack DataWithout compiler support, cannot change stacks addressBecause we cant change stacks interior pointers (return frame pointer, function arguments, etc.)Existing pointers to addresses in original stack become invalidSolution: isomalloc addressesReserve address space on every processor for every thread stackUse mmap to scatter stacks in virtual memory efficientlyIdea comes from PM2

Migrate Stack DataThread 2 stackThread 3 stackThread 4 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFThread 1 stackCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryMigrate Thread 3

Migrate Stack Data: IsomallocThread 2 stackThread 4 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFThread 1 stackCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryMigrate Thread 3Thread 3 stack

Migrate Stack DataIsomalloc is a completely automatic solutionNo changes needed in application or compilersJust like a software shared-memory system, but with proactive pagingBut has a few limitationsDepends on having large quantities of virtual address space (best on 64-bit)32-bit machines can only have a few gigs of isomalloc stacks across the whole machineDepends on unportable mmap Which addresses are safe? (We must guess!)What about Windows? Or Blue Gene?

Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stack

Aliasing Stack Data: Run Thread 2Thread 2 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackExecution Copy

Aliasing Stack Data: Run Thread 3Thread 3 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackExecution Copy

Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackThread 3 stackMigrate Thread 3

Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackExecution CopyThread 3 stack

Aliasing Stack DataDoes not depend on having large quantities of virtual address space Works well on 32-bit machines Requires only one mmapd region at a timeWorks even on Blue Gene!Downsides:Thread context switch requires munmap/mmap (3us)Can only have one thread running at a time (so no SMPs!)-thread memoryalias link time option

Heap DataHeap data is any dynamically allocated dataC malloc and freeC++ new and deleteF90 ALLOCATE and DEALLOCATEArrays and linked data structures are almost always heap data

Migrate Heap DataAutomatic solution: isomalloc all heap data just like stacks!-memory isomalloc link optionOverrides malloc/freeNo new application code neededSame limitations as isomalloc; page allocation granularity (huge!)Manual solution: application moves its heap dataNeed to be able to size message buffer, pack data into message, and unpack on other sidepup abstraction does all three

Structured DaggerWhat is it?A coordination language built on top of Charm++Express control flow in interface fileMotivationCharm++s asynchrony is efficient and reliable, but tough to programSplit phase - Flags, buffering, out-of-order receives, etc.Threads are easy to program, but less efficient and less reliableImplementation complexityPorting headachesWant benefits of both!

Structured Dagger Constructswhen {code}Do not continue until method is calledInternally generates flags, checks, etc.Does not use threadsatomic {code}Call ordinary sequential C++ codeif/else/for/whileC-like control flowoverlap {code1 code2 ...}Execute code segments in parallelforallParallel DoLike a parameterized overlap

Stencil Example Using Structured Dagger

Overlap for LeanMD Initialization

For for LeanMD timeloop

Thank You!Free source, binaries, manuals, and more information at: http://charm.cs.uiuc.edu/Parallel Programming Lab at University of Illinois

Message layerBuilding charm involves selecting both architecture types and communication architectureCharm use a communication-platform pair in most cases to represent basic compilation configuration.Not sort on prioritySpecial chare arrayMulticast in MDWait for replyIsomalloc Stacks A user-level thread, when suspended, consists of a stack and a set ofpreserved machine registers. During migration, the machine registers are simply copiedto the new processor. The stack, unfortunately, is very difficult to move. In a distributedmemory parallel machine, if the stack is moved to a new machine, it will almost undoubtedlybe allocated at a different location, so existing pointers to addresses in theoriginal stack would become invalid when the stack moves. We cannot reliably updateall the pointers to stack-allocated variables, because these pointers are stored in machineregisters and stack frames, whose layout is highly machine- and compiler-dependent.

PM2 [12] is another migratable thread system, which treats threads as remote procedure calls, which return some data on completion .Multithreaded systems such as PM 2 [NM96] require every thread to store its state in specially allocated memory, so that the system can migrate the thread automatically

Express control flow

Documents

Advanced Charm++ Tutorial