71
1 Advanced Charm++ Tutorial Charm Workshop Tutorial Gengbin Zheng charm.cs.uiuc.edu 10/19/2005

Advanced Charm++ Tutorial

  • Upload
    sage

  • View
    65

  • Download
    2

Embed Size (px)

DESCRIPTION

Advanced Charm++ Tutorial. Charm Workshop Tutorial Gengbin Zheng charm.cs.uiuc.edu 10/19/2005. Building Charm++ Advanced messaging Interface file (.ci) Groups Delegation Array multicast Advanced load balancing Threads SDAG. Charm++ on Parallel Machines. Runs on: - PowerPoint PPT Presentation

Citation preview

  • Advanced Charm++ TutorialCharm Workshop Tutorial

    Gengbin Zhengcharm.cs.uiuc.edu10/19/2005

  • Building Charm++Advanced messagingInterface file (.ci)GroupsDelegationArray multicastAdvanced load balancingThreadsSDAG

  • Charm++ on Parallel MachinesRuns on:Any machine with MPI, includingIBM SP, Blue Gene/LCray XT3Origin2000PSCs Lemieux (Quadrics Elan)Clusters with Ethernet (Udp/Tcp)Clusters with Myrinet (GM)Clusters with Amasso cardsApple clustersEven Windows!SMP-Aware (pthreads)

  • Communication ArchitectureConverseCommunication APINetMPIElanUDP(machine-eth.c)TCP(machine-tcp.c)Myrinet(machine-gm.c)Ammasso(machine-ammasso.c)BG/L

  • Compiling Charm++./buildUsage: build [charmc-options ...]

    : converse charm++ LIBS AMPI FEM bluegene pose jade msa: doc ps-doc pdf-doc html-doc:bluegenel mpi-sol net-sol-amd64elan-axp mpi-sp net-sol-x86elan-linux-ia64 ncube2 net-sunexemplar net-axp net-win32mpi-axp net-cygwin origin2000mpi-bluegenel net-hp origin-pthreadsmpi-crayx1 net-hp-ia64 paragon-redmpi-crayxt3 net-irix shmem-axpmpi-exemplar net-linux sim-linuxmpi-hp-ia64 net-linux-amd64 sp3mpi-linux net-linux-axp t3empi-linux-amd64 net-linux-ia64 uth-linuxmpi-linux-axp net-ppc-darwin uth-win32mpi-linux-ia64 net-rs6k vmi-linuxmpi-origin net-sol vmi-linux-ia64mpi-ppc-darwin

    : compiler and platform specific optionscc cc64 cxx kcc pgcc acc icc ecc gcc3 mpcc pathscalehelp smp gm tcp vmi scyld clustermatic bluegene ooc syncft papi--incdir --libdir --basedir --no-build-shared -j

    : normal compiler options e.g. -g -O -save -verbose

    To get more detailed help, run ./build --help

  • Build Options: compiler and platform specific optionscc cc64 cxx kcc pgcc acc icc ecc gcc3 mpcc pathscalehelp smp gm tcp vmi scyld clustermatic bluegene ooc syncft papi--incdir --libdir --basedir --no-build-shared -j

    For platform specific options, use help option: help platform specific help, e.g. ./build charm++ net-linux help

    Choose a compiler (only one option is allowed from this section): cc, cc64 For Sun WorkShop C++ 32/64 bit compilers cxx DIGITAL C++ compiler (DEC Alpha) kcc KAI C++ compiler pgcc Portland Group's C++ compiler acc HP aCC compiler icc Intel C/C++ compiler for Linux IA32 ecc Intel C/C++ compiler for Linux IA64 gcc3 use gcc3 - GNU GCC/G++ version 3 mpcc SUN Solaris C++ compiler for MPI pathscale use pathscale compiler suite

    Platform specific options (choose multiple if apply): smp support for SMP, multithreaded charm on each node mpt use SGI Message Passing Toolkit (only for mpi version) gm use Myrinet for communication tcp use TCP sockets for communication (ony for net version) vmi use NCSA's VMI for communication (only for mpi version) scyld compile for Scyld Beowulf cluster based on bproc clustermatic compile for Clustermatic (support version 3 and 4)

    Advanced options: bluegene compile for BigSim (Blue Gene) simulator ooc compile with out of core support syncft compile with Charm++ fault tolerance support papi compile with PAPI performance counter support (if any)

    Charm++ dynamic libraries: --build-shared build Charm++ dynamic libraries (.so) (default) --no-build-shared don't build Charm++'s shared libraries

    Miscellaneous options: --incdir=DIR specify additional include path for compiler --libdir=DIR specify additional lib path for compiler --basedir=DIR shortcut for the above two - DIR/include and DIR/lib -j[N] parallel make, N is the number of paralle make jobs --with-romio build AMPI with ROMIO library

  • Build ScriptBuild script does:./build [charmc-options ...] Creates directories and /tmp Copies src/scripts/Makefile into /tmp Does a "make OPTS=" in /tmp.That's all build does. The rest is handled by the Makefile.

  • How build worksbuild charm++ net-linux gm smp kcc bluegeneSort gm, smp and bluegeneMkdir net-linux-bluegene-gm-smp-kccCat conv-mach-[kcc|bluegene|gm|smp].h to conv-mach-opt.hCat conv-mach-[kcc|bluegene|gm|smp].sh to conv-mach-opt.shGather files from net, etc (Makefile)Make charm++ under net-linux-bluegene-gm-smp-kcc/tmp

  • How Charmrun Works?Charmruncharmrun +p4 ./pgm

  • Charmrun (batch mode)Charmruncharmrun +p4 ++batch 8

  • Debugging Charm++ ApplicationsPrintfGdbSequentially (standalone mode)gdb ./pgm +vp16Attach gdb manuallyRun debugger in xtermcharmrun +p4 pgm ++debugcharmrun +p4 pgm ++debug-no-pauseMemory paranoid-memoryParallel debugger

  • How to Become a Charm++ HackerAdvanced Charm++Advanced MessagingInterface files (ci)Writing system librariesGroupsDelegationArray multicastThreadsSDAG

  • Advanced Messaging

  • Prioritized ExecutionCharm++ schedulerDefault - FIFO (oldest message)Prioritized executionIf several messages available, Charm will process the messages in the order of their prioritiesVery useful for speculative work, ordering timestamps, etc...

  • Priority ClassesCharm++ scheduler has three queues: high, default, and lowAs signed integer priorities:-MAXINT Highest priority -- -10 Default priority1 -- +MAXINT Lowest priorityAs unsigned bitvector priorities:0x0000 Highest priority -- 0x7FFF0x8000 Default priority0x8001 -- 0xFFFF Lowest priority

  • Prioritized MessagesNumber of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages

    Signed integer priorities:*CkPriorityPtr(msg)=-1;CkSetQueueing(m, CK_QUEUEING_IFIFO);Unsigned bitvector prioritiesCkPriorityPtr(msg)[0]=0x7fffffff;CkSetQueueing(m, CK_QUEUEING_BFIFO);

  • Prioritized Marshalled MessagesPass CkEntryOptions as last parameterFor signed integer priorities:CkEntryOptions opts;opts.setPriority(-1);fooProxy.bar(x,y,opts);

    For bitvector priorities:CkEntryOptions opts;unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF};opts.setPriority(64,prio);fooProxy.bar(x,y,opts);

  • Advanced Message FeaturesRead-only messagesEntry method agrees not to modify or delete the messageAvoids message copy for broadcasts, saving timeInline messagesDirect invocation if on local processorExpedited messagesMessage do not go through the charm++ scheduler (faster)Immediate messagesEntries are executed in an interrupt or the communication threadVery fast, but tough to get rightImmediate messages only currently work for NodeGroups and Group (non-smp)

  • Read-Only, Expedited, ImmediateAll declared in the .ci file{ entry [nokeep] void foo_readonly(Msg *); entry [inline] void foo_inl(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *); .. .. .. };

  • Interface File (ci)

  • Interface File Example mainmodule hello { include myType.h initnode void myNodeInit(); initproc void myInit();

    mainchare mymain { entry mymain(CkArgMsg *m); };

    array[1D] foo { entry foo(int problemNo); entry void bar1(int x); entry void bar2(myType x); }; };

  • Include and InitcallIncludeInclude an external header filesInitcallUser plugging code to be invoked in Charm++s startup phaseInitnodeCalled once on every nodeInitprocCalled once on every processorInitnode calls are called before Initproc calls

  • Entry AttributesThreadedFunction is invoked in a CthThreadSyncBlocking methods, can return values as a messageCaller must be a threadExclusiveFor Node GroupDo not execute while other exclusive entry methods of its node group are executing in the same node NotraceInvisible to trace projectionsentry [notrace] void recvMsg(multicastGrpMsg *m);

  • Groups/Node Groups

  • Groups and Node GroupsGroups - a collection of objects (chares)Also called branch office chares (BOC)Exactly one representative on each processorIdeally suited for system librariesSimilar to arrays:Broadcasts, reductions, indexingBut not completely like arrays:Non-migratable; one per processorNode GroupsOne per nodeSMP

  • Declarations.ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; nodegroup mynodegroup { entry mynodegroup(); //Constructor entry void foo(foomsg *); //Entry method };C++ fileclass mygroup : public Group {mygroup() {} void foo(foomsg *m) { CkPrintf(Do Nothing);}};class mynodegroup : public NodeGroup {mynodegroup() {} void foo(foomsg *m) { CkPrintf(Do Nothing);}};

  • Creating and Calling GroupsCreationp = CProxy_mygroup::ckNew();Remote invocationp.foo(msg); //broadcastp[1].foo(msg); //asynchronousp.foo(msg, npes, pes); // list sendDirect local access mygroup *g=p.ckLocalBranch();g->foo(.); //local invocationDanger: if you migrate, the group stays behind!

  • Delegation

  • DelegationCustomized implementation of messagingEnables Charm++ proxy messages to be forwarded to a delegation manager groupDelegation managertrap calls to proxy sends and apply optimizationsDelegation manager must inherit from CkDelegateMgr classUser program must to callproxy.ckDelegate(mgrID);

  • Delegation Interface.ci filegroup MyDelegateMgr {entry MyDelegateMgr(); //Constructor };.h fileclass MyDelegateMgr : public CkDelegateMgr {MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); .. ..}

  • Array Multicast

  • Array Multicast/reduction libraryArray section a subset of chare arrayArray section creationEnumerate array indicesCkVec elems; // add array indices for (int i=0; i
  • Array Section MulticastOnce have the array section proxydo multicast to all the section members: CProxySection_Hello proxy; proxy.foo( msg) // multicast send messages to one member using its local index proxy[0].foo( msg)

  • Array Section MulticastMulticast via delegationCkMulticast communication libraryCProxySection_Hello sectProxy = CProxySection_Hello::ckNew(); CkGroupID mCastGrpId = CProxy_CkMulticastMgr::ckNew(); CkMulticastMgr *mcastGrp = CProxy_CkMulticastMgr(mCastGrpId).ckLocalBranch(); sectProxy.ckSectionDelegate(mCastGrpId); // initialize proxy sectProxy.foo(...); //multicast via delegation Note, to use CkMulticast library, all multicast messages must inherit from CkMcastBaseMsg, as following: class HiMsg : public CkMcastBaseMsg, public CMessage_HiMsg { public: int *data; };

  • Array Section ReductionSection reduction with delegationUse default reduction callbackCProxySection_Hello sectProxy; CkMulticastMgr *mcastGrp = CProxy_CkMulticastMgr(mCastGrpId).ckLocalBranch(); mcastGrp->setReductionClient(sectProxy, new CkCallback(...)); ReductionCkGetSectionInfo(sid, msg);CkCallback cb(CkIndex_myArray::foo(NULL),thisProxy); mcastGrp->contribute(sizeof(int), &data, CkReduction::sum_int, sid, cb);

  • With MigrationWorks with migrationWhen intermediate nodes migrateWhen migrates, multicast tree will be automatically rebuiltRoot processorApplication needs to initiate the rebuildWill change to automatic in future

  • Advanced Load-balancers

    Writing a Load-balancing Strategy

  • Advanced load balancing: Writing a new strategyInherit from CentralLB and implement the work() function class foolb : public CentralLB {public: .. .. .. void work (CentralLB::LDStats* stats, int count); .. .. .. };

  • LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; //.. .. .. }//Dummy Work function which assigns all objects to//processor 0//Dont implement it!void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count < nobjs; count++) stats.to_proc[count] = 0;}

  • Compiling and IntegrationEdit and run Makefile_lb.shCreates Make.lb which is included by the main MakefileRun make depends to correct dependenciesRebuild charm++ and is now available in balancer fooLB

  • Threads in Charm++

  • Why use Threads?They provide one key feature: blockingSuspend execution (e.g., at message receive)Do something elseResume later (e.g., after message arrives)Example: MPI_Recv, MPI_Wait semanticsFunction call interface more convenient than message-passingRegular call/return structure (no CkCallbacks) with complete control flowAllows blocking in middle of deeply nested communication subroutine

  • Why not use Threads?SlowerAround 1us context-switching overhead unavoidableCreation/deletion perhaps 10usMigration more difficultState of thread is scattered through stack, which is maintained by compilerBy contrast, state of object is maintained by usersThread disadvantages form the motivation to use SDAG (later)

  • Context Switch Cost

    Chart1

    1.3072420.8308391.311779

    0.772860.8504550.90111

    1.904790.8659861.908973

    2.9926070.8660782.86693

    3.7492480.8583123.363098

    3.3885430.8599482.888108

    3.7332930.852663.339983

    3.0537650.8452612.897983

    3.6421640.8448183.3543

    3.1303910.8532292.964149

    3.7288270.8575393.392727

    3.194250.8557392.980291

    3.2487630.8589673.03115

    3.3423350.870833.121787

    3.3860540.867923.197932

    3.480980.8958183.229433

    3.614770.9081363.27775

    3.757890.9265193.326516

    3.8153961.1218163.435723

    4.0597821.3080543.492936

    4.7890791.6634794.063329

    5.5959521.731984.618952

    6.2882311.8012575.293805

    7.0109171.8426385.924449

    8.4498921.8497287.207066

    10.4577151.8832949.784505

    6001.83777114.188138

    7001.86157617.019212

    8001.86302516.843023

    9001.89366516.032364

    10001.94382713.925054

    15001.93452715.767073

    20001.94339216.333523

    25001.95793116.503304

    30001.96230816.60465

    40001.96885316.454164

    50001.97579616.765595

    60001.97304116.826278

    70001.97321517.239769

    80001.9763738000

    90001.9781459000

    100001.97730110000

    150001.97759715000

    Process

    CthThreads

    Pthreads

    Number of flows

    Context switch time (us)

    Context Switching Time (Turing Apple Cluster)

    Sheet1

    11.3072420.8308391.311779

    20.772860.8504550.90111

    31.904790.8659861.908973

    42.9926070.8660782.86693

    53.7492480.8583123.363098

    63.3885430.8599482.888108

    73.7332930.852663.339983

    83.0537650.8452612.897983

    93.6421640.8448183.3543

    103.1303910.8532292.964149

    153.7288270.8575393.392727

    203.194250.8557392.980291

    303.2487630.8589673.03115

    403.3423350.870833.121787

    503.3860540.867923.197932

    603.480980.8958183.229433

    703.614770.9081363.27775

    803.757890.9265193.326516

    903.8153961.1218163.435723

    1004.0597821.3080543.492936

    1504.7890791.6634794.063329

    2005.5959521.731984.618952

    2506.2882311.8012575.293805

    3007.0109171.8426385.924449

    4008.4498921.8497287.207066

    50010.4577151.8832949.784505

    6001.83777114.188138

    7001.86157617.019212

    8001.86302516.843023

    9001.89366516.032364

    10001.94382713.925054

    15001.93452715.767073

    20001.94339216.333523

    25001.95793116.503304

    30001.96230816.60465

    40001.96885316.454164

    50001.97579616.765595

    60001.97304116.826278

    70001.97321517.239769

    80001.976373

    90001.978145

    100001.977301

    150001.977597

    Sheet1

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    Process

    CthThreads

    Pthreads

    Number of flows

    Context switch time (us)

    Sheet2

    Sheet3

  • What are (Converse) Threads?One flow of control (instruction stream)Machine Registers & program counterExecution stackLike pthreads (kernel threads)Only different:Implemented at user level (in Converse)Scheduled at user level; non-preemptiveMigratable between nodes

  • How do I use Threads?Many options:AMPI Always uses threads via TCharm libraryCharm++[threaded] entry methods run in a thread[sync] methodsConverseC routines CthCreate/CthSuspend/CthAwakenEverything else is built on theseImplemented usingSYSV makecontext/setcontextPOSIX setjmp/alloca/longjmpAssembly code

  • How do I use Threads (example)Blocking API routine: find array elementint requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src)}Send request and suspendint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return;}Awaken thread when data arrivesvoid myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}

  • How do I use Threads (example)

    Send request, suspend, recv, awaken, returnint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend();

    return stashed_return;}void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}

  • Thread Migration

  • Stack DataThe stack is used by the compiler to track function calls and provide temporary storageLocal VariablesSubroutine ParametersC alloca storageMost of the variables in a typical application are stack dataStack is allocated by Charm run-time as heap memory (+stacksize)

  • Migrate Stack DataWithout compiler support, cannot change stacks addressBecause we cant change stacks interior pointers (return frame pointer, function arguments, etc.)Existing pointers to addresses in original stack become invalidSolution: isomalloc addressesReserve address space on every processor for every thread stackUse mmap to scatter stacks in virtual memory efficientlyIdea comes from PM2

  • Migrate Stack DataThread 2 stackThread 3 stackThread 4 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFThread 1 stackCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryMigrate Thread 3

  • Migrate Stack Data: IsomallocThread 2 stackThread 4 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFThread 1 stackCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryMigrate Thread 3Thread 3 stack

  • Migrate Stack DataIsomalloc is a completely automatic solutionNo changes needed in application or compilersJust like a software shared-memory system, but with proactive pagingBut has a few limitationsDepends on having large quantities of virtual address space (best on 64-bit)32-bit machines can only have a few gigs of isomalloc stacks across the whole machineDepends on unportable mmap Which addresses are safe? (We must guess!)What about Windows? Or Blue Gene?

  • Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stack

  • Aliasing Stack Data: Run Thread 2Thread 2 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackExecution Copy

  • Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stack

  • Aliasing Stack Data: Run Thread 3Thread 3 stackProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackExecution Copy

  • Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackThread 3 stackMigrate Thread 3

  • Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stack

  • Aliasing Stack DataProcessor As MemoryCodeGlobalsHeap0x000000000xFFFFFFFFCodeGlobalsHeap0x000000000xFFFFFFFFProcessor Bs MemoryThread 2 stackThread 3 stackExecution CopyThread 3 stack

  • Aliasing Stack DataDoes not depend on having large quantities of virtual address space Works well on 32-bit machines Requires only one mmapd region at a timeWorks even on Blue Gene!Downsides:Thread context switch requires munmap/mmap (3us)Can only have one thread running at a time (so no SMPs!)-thread memoryalias link time option

  • Heap DataHeap data is any dynamically allocated dataC malloc and freeC++ new and deleteF90 ALLOCATE and DEALLOCATEArrays and linked data structures are almost always heap data

  • Migrate Heap DataAutomatic solution: isomalloc all heap data just like stacks!-memory isomalloc link optionOverrides malloc/freeNo new application code neededSame limitations as isomalloc; page allocation granularity (huge!)Manual solution: application moves its heap dataNeed to be able to size message buffer, pack data into message, and unpack on other sidepup abstraction does all three

  • SDAG

  • Structured DaggerWhat is it?A coordination language built on top of Charm++Express control flow in interface fileMotivationCharm++s asynchrony is efficient and reliable, but tough to programSplit phase - Flags, buffering, out-of-order receives, etc.Threads are easy to program, but less efficient and less reliableImplementation complexityPorting headachesWant benefits of both!

  • Structured Dagger Constructswhen {code}Do not continue until method is calledInternally generates flags, checks, etc.Does not use threadsatomic {code}Call ordinary sequential C++ codeif/else/for/whileC-like control flowoverlap {code1 code2 ...}Execute code segments in parallelforallParallel DoLike a parameterized overlap

  • Stencil Example Using Structured Dagger

  • Overlap for LeanMD Initialization

  • For for LeanMD timeloop

  • Thank You!Free source, binaries, manuals, and more information at: http://charm.cs.uiuc.edu/Parallel Programming Lab at University of Illinois

    Message layerBuilding charm involves selecting both architecture types and communication architectureCharm use a communication-platform pair in most cases to represent basic compilation configuration.Not sort on prioritySpecial chare arrayMulticast in MDWait for replyIsomalloc Stacks A user-level thread, when suspended, consists of a stack and a set ofpreserved machine registers. During migration, the machine registers are simply copiedto the new processor. The stack, unfortunately, is very difficult to move. In a distributedmemory parallel machine, if the stack is moved to a new machine, it will almost undoubtedlybe allocated at a different location, so existing pointers to addresses in theoriginal stack would become invalid when the stack moves. We cannot reliably updateall the pointers to stack-allocated variables, because these pointers are stored in machineregisters and stack frames, whose layout is highly machine- and compiler-dependent.

    PM2 [12] is another migratable thread system, which treats threads as remote procedure calls, which return some data on completion .Multithreaded systems such as PM 2 [NM96] require every thread to store its state in specially allocated memory, so that the system can migrate the thread automatically

    Express control flow