Advanced Charm++ Tutorial

  • Published on
    05-Jan-2016

  • View
    49

  • Download
    2

Embed Size (px)

DESCRIPTION

Advanced Charm++ Tutorial. Charm Workshop Tutorial Gengbin Zheng charm.cs.uiuc.edu 10/19/2005. Building Charm++ Advanced messaging Interface file (.ci) Groups Delegation Array multicast Advanced load balancing Threads SDAG. Charm++ on Parallel Machines. Runs on: - PowerPoint PPT Presentation

Transcript

<ul><li><p>Advanced Charm++ TutorialCharm Workshop Tutorial</p><p>Gengbin Zhengcharm.cs.uiuc.edu10/19/2005</p></li><li><p>Building Charm++Advanced messagingInterface file (.ci)GroupsDelegationArray multicastAdvanced load balancingThreadsSDAG</p></li><li><p>Charm++ on Parallel MachinesRuns on:Any machine with MPI, includingIBM SP, Blue Gene/LCray XT3Origin2000PSCs Lemieux (Quadrics Elan)Clusters with Ethernet (Udp/Tcp)Clusters with Myrinet (GM)Clusters with Amasso cardsApple clustersEven Windows!SMP-Aware (pthreads)</p></li><li><p>Communication ArchitectureConverseCommunication APINetMPIElanUDP(machine-eth.c)TCP(machine-tcp.c)Myrinet(machine-gm.c)Ammasso(machine-ammasso.c)BG/L</p></li><li><p>Compiling Charm++./buildUsage: build [charmc-options ...]</p><p>: converse charm++ LIBS AMPI FEM bluegene pose jade msa: doc ps-doc pdf-doc html-doc:bluegenel mpi-sol net-sol-amd64elan-axp mpi-sp net-sol-x86elan-linux-ia64 ncube2 net-sunexemplar net-axp net-win32mpi-axp net-cygwin origin2000mpi-bluegenel net-hp origin-pthreadsmpi-crayx1 net-hp-ia64 paragon-redmpi-crayxt3 net-irix shmem-axpmpi-exemplar net-linux sim-linuxmpi-hp-ia64 net-linux-amd64 sp3mpi-linux net-linux-axp t3empi-linux-amd64 net-linux-ia64 uth-linuxmpi-linux-axp net-ppc-darwin uth-win32mpi-linux-ia64 net-rs6k vmi-linuxmpi-origin net-sol vmi-linux-ia64mpi-ppc-darwin</p><p>: compiler and platform specific optionscc cc64 cxx kcc pgcc acc icc ecc gcc3 mpcc pathscalehelp smp gm tcp vmi scyld clustermatic bluegene ooc syncft papi--incdir --libdir --basedir --no-build-shared -j</p><p>: normal compiler options e.g. -g -O -save -verbose</p><p>To get more detailed help, run ./build --help</p></li><li><p>Build Options: compiler and platform specific optionscc cc64 cxx kcc pgcc acc icc ecc gcc3 mpcc pathscalehelp smp gm tcp vmi scyld clustermatic bluegene ooc syncft papi--incdir --libdir --basedir --no-build-shared -j</p><p>For platform specific options, use help option: help platform specific help, e.g. ./build charm++ net-linux help</p><p>Choose a compiler (only one option is allowed from this section): cc, cc64 For Sun WorkShop C++ 32/64 bit compilers cxx DIGITAL C++ compiler (DEC Alpha) kcc KAI C++ compiler pgcc Portland Group's C++ compiler acc HP aCC compiler icc Intel C/C++ compiler for Linux IA32 ecc Intel C/C++ compiler for Linux IA64 gcc3 use gcc3 - GNU GCC/G++ version 3 mpcc SUN Solaris C++ compiler for MPI pathscale use pathscale compiler suite</p><p>Platform specific options (choose multiple if apply): smp support for SMP, multithreaded charm on each node mpt use SGI Message Passing Toolkit (only for mpi version) gm use Myrinet for communication tcp use TCP sockets for communication (ony for net version) vmi use NCSA's VMI for communication (only for mpi version) scyld compile for Scyld Beowulf cluster based on bproc clustermatic compile for Clustermatic (support version 3 and 4)</p><p>Advanced options: bluegene compile for BigSim (Blue Gene) simulator ooc compile with out of core support syncft compile with Charm++ fault tolerance support papi compile with PAPI performance counter support (if any)</p><p>Charm++ dynamic libraries: --build-shared build Charm++ dynamic libraries (.so) (default) --no-build-shared don't build Charm++'s shared libraries</p><p>Miscellaneous options: --incdir=DIR specify additional include path for compiler --libdir=DIR specify additional lib path for compiler --basedir=DIR shortcut for the above two - DIR/include and DIR/lib -j[N] parallel make, N is the number of paralle make jobs --with-romio build AMPI with ROMIO library</p></li><li><p>Build ScriptBuild script does:./build [charmc-options ...] Creates directories and /tmp Copies src/scripts/Makefile into /tmp Does a "make OPTS=" in /tmp.That's all build does. The rest is handled by the Makefile.</p></li><li><p>How build worksbuild charm++ net-linux gm smp kcc bluegeneSort gm, smp and bluegeneMkdir net-linux-bluegene-gm-smp-kccCat conv-mach-[kcc|bluegene|gm|smp].h to conv-mach-opt.hCat conv-mach-[kcc|bluegene|gm|smp].sh to conv-mach-opt.shGather files from net, etc (Makefile)Make charm++ under net-linux-bluegene-gm-smp-kcc/tmp</p></li><li><p>How Charmrun Works?Charmruncharmrun +p4 ./pgm</p></li><li><p>Charmrun (batch mode)Charmruncharmrun +p4 ++batch 8</p></li><li><p>Debugging Charm++ ApplicationsPrintfGdbSequentially (standalone mode)gdb ./pgm +vp16Attach gdb manuallyRun debugger in xtermcharmrun +p4 pgm ++debugcharmrun +p4 pgm ++debug-no-pauseMemory paranoid-memoryParallel debugger</p></li><li><p>How to Become a Charm++ HackerAdvanced Charm++Advanced MessagingInterface files (ci)Writing system librariesGroupsDelegationArray multicastThreadsSDAG</p></li><li><p>Advanced Messaging</p></li><li><p>Prioritized ExecutionCharm++ schedulerDefault - FIFO (oldest message)Prioritized executionIf several messages available, Charm will process the messages in the order of their prioritiesVery useful for speculative work, ordering timestamps, etc...</p></li><li><p>Priority ClassesCharm++ scheduler has three queues: high, default, and lowAs signed integer priorities:-MAXINT Highest priority -- -10 Default priority1 -- +MAXINT Lowest priorityAs unsigned bitvector priorities:0x0000 Highest priority -- 0x7FFF0x8000 Default priority0x8001 -- 0xFFFF Lowest priority</p></li><li><p>Prioritized MessagesNumber of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages</p><p>Signed integer priorities:*CkPriorityPtr(msg)=-1;CkSetQueueing(m, CK_QUEUEING_IFIFO);Unsigned bitvector prioritiesCkPriorityPtr(msg)[0]=0x7fffffff;CkSetQueueing(m, CK_QUEUEING_BFIFO);</p></li><li><p>Prioritized Marshalled MessagesPass CkEntryOptions as last parameterFor signed integer priorities:CkEntryOptions opts;opts.setPriority(-1);fooProxy.bar(x,y,opts);</p><p>For bitvector priorities:CkEntryOptions opts;unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF};opts.setPriority(64,prio);fooProxy.bar(x,y,opts);</p></li><li><p>Advanced Message FeaturesRead-only messagesEntry method agrees not to modify or delete the messageAvoids message copy for broadcasts, saving timeInline messagesDirect invocation if on local processorExpedited messagesMessage do not go through the charm++ scheduler (faster)Immediate messagesEntries are executed in an interrupt or the communication threadVery fast, but tough to get rightImmediate messages only currently work for NodeGroups and Group (non-smp)</p></li><li><p>Read-Only, Expedited, ImmediateAll declared in the .ci file{ entry [nokeep] void foo_readonly(Msg *); entry [inline] void foo_inl(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *); .. .. .. }; </p></li><li><p>Interface File (ci)</p></li><li><p>Interface File Example mainmodule hello { include myType.h initnode void myNodeInit(); initproc void myInit();</p><p> mainchare mymain { entry mymain(CkArgMsg *m); };</p><p> array[1D] foo { entry foo(int problemNo); entry void bar1(int x); entry void bar2(myType x); }; };</p></li><li><p>Include and InitcallIncludeInclude an external header filesInitcallUser plugging code to be invoked in Charm++s startup phaseInitnodeCalled once on every nodeInitprocCalled once on every processorInitnode calls are called before Initproc calls</p></li><li><p>Entry AttributesThreadedFunction is invoked in a CthThreadSyncBlocking methods, can return values as a messageCaller must be a threadExclusiveFor Node GroupDo not execute while other exclusive entry methods of its node group are executing in the same node NotraceInvisible to trace projectionsentry [notrace] void recvMsg(multicastGrpMsg *m);</p></li><li><p>Groups/Node Groups</p></li><li><p>Groups and Node GroupsGroups - a collection of objects (chares)Also called branch office chares (BOC)Exactly one representative on each processorIdeally suited for system librariesSimilar to arrays:Broadcasts, reductions, indexingBut not completely like arrays:Non-migratable; one per processorNode GroupsOne per nodeSMP</p></li><li><p>Declarations.ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; nodegroup mynodegroup { entry mynodegroup(); //Constructor entry void foo(foomsg *); //Entry method };C++ fileclass mygroup : public Group {mygroup() {} void foo(foomsg *m) { CkPrintf(Do Nothing);}};class mynodegroup : public NodeGroup {mynodegroup() {} void foo(foomsg *m) { CkPrintf(Do Nothing);}};</p></li><li><p>Creating and Calling GroupsCreationp = CProxy_mygroup::ckNew();Remote invocationp.foo(msg); //broadcastp[1].foo(msg); //asynchronousp.foo(msg, npes, pes); // list sendDirect local access mygroup *g=p.ckLocalBranch();g-&gt;foo(.); //local invocationDanger: if you migrate, the group stays behind!</p></li><li><p>Delegation</p></li><li><p>DelegationCustomized implementation of messagingEnables Charm++ proxy messages to be forwarded to a delegation manager groupDelegation managertrap calls to proxy sends and apply optimizationsDelegation manager must inherit from CkDelegateMgr classUser program must to callproxy.ckDelegate(mgrID);</p></li><li><p>Delegation Interface.ci filegroup MyDelegateMgr {entry MyDelegateMgr(); //Constructor };.h fileclass MyDelegateMgr : public CkDelegateMgr {MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &amp;idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &amp;s); .. ..}</p></li><li><p>Array Multicast</p></li><li>Array Multicast/reduction libraryArray section a subset of chare arrayArray section creationEnumerate array indicesCkVec elems; // add array indices for (int i=0; i</li><li><p>Array Section MulticastOnce have the array section proxydo multicast to all the section members: CProxySection_Hello proxy; proxy.foo( msg) // multicast send messages to one member using its local index proxy[0].foo( msg)</p></li><li><p>Array Section MulticastMulticast via delegationCkMulticast communication libraryCProxySection_Hello sectProxy = CProxySection_Hello::ckNew(); CkGroupID mCastGrpId = CProxy_CkMulticastMgr::ckNew(); CkMulticastMgr *mcastGrp = CProxy_CkMulticastMgr(mCastGrpId).ckLocalBranch(); sectProxy.ckSectionDelegate(mCastGrpId); // initialize proxy sectProxy.foo(...); //multicast via delegation Note, to use CkMulticast library, all multicast messages must inherit from CkMcastBaseMsg, as following: class HiMsg : public CkMcastBaseMsg, public CMessage_HiMsg { public: int *data; }; </p></li><li><p>Array Section ReductionSection reduction with delegationUse default reduction callbackCProxySection_Hello sectProxy; CkMulticastMgr *mcastGrp = CProxy_CkMulticastMgr(mCastGrpId).ckLocalBranch(); mcastGrp-&gt;setReductionClient(sectProxy, new CkCallback(...)); ReductionCkGetSectionInfo(sid, msg);CkCallback cb(CkIndex_myArray::foo(NULL),thisProxy); mcastGrp-&gt;contribute(sizeof(int), &amp;data, CkReduction::sum_int, sid, cb); </p></li><li><p>With MigrationWorks with migrationWhen intermediate nodes migrateWhen migrates, multicast tree will be automatically rebuiltRoot processorApplication needs to initiate the rebuildWill change to automatic in future</p></li><li><p>Advanced Load-balancers</p><p>Writing a Load-balancing Strategy</p></li><li><p>Advanced load balancing: Writing a new strategyInherit from CentralLB and implement the work() function class foolb : public CentralLB {public: .. .. .. void work (CentralLB::LDStats* stats, int count); .. .. .. };</p></li><li><p>LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; //.. .. .. }//Dummy Work function which assigns all objects to//processor 0//Dont implement it!void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count &lt; nobjs; count++) stats.to_proc[count] = 0;} </p></li><li><p>Compiling and IntegrationEdit and run Makefile_lb.shCreates Make.lb which is included by the main MakefileRun make depends to correct dependenciesRebuild charm++ and is now available in balancer fooLB</p></li><li><p>Threads in Charm++</p></li><li><p>Why use Threads?They provide one key feature: blockingSuspend execution (e.g., at message receive)Do something elseResume later (e.g., after message arrives)Example: MPI_Recv, MPI_Wait semanticsFunction call interface more convenient than message-passingRegular call/return structure (no CkCallbacks) with complete control flowAllows blocking in middle of deeply nested communication subroutine</p></li><li><p>Why not use Threads?SlowerAround 1us context-switching overhead unavoidableCreation/deletion perhaps 10usMigration more difficultState of thread is scattered through stack, which is maintained by compilerBy contrast, state of object is maintained by usersThread disadvantages form the motivation to use SDAG (later)</p></li><li><p>Context Switch Cost</p><p>Chart1</p><p>1.3072420.8308391.311779</p><p>0.772860.8504550.90111</p><p>1.904790.8659861.908973</p><p>2.9926070.8660782.86693</p><p>3.7492480.8583123.363098</p><p>3.3885430.8599482.888108</p><p>3.7332930.852663.339983</p><p>3.0537650.8452612.897983</p><p>3.6421640.8448183.3543</p><p>3.1303910.8532292.964149</p><p>3.7288270.8575393.392727</p><p>3.194250.8557392.980291</p><p>3.2487630.8589673.03115</p><p>3.3423350.870833.121787</p><p>3.3860540.867923.197932</p><p>3.480980.8958183.229433</p><p>3.614770.9081363.27775</p><p>3.757890.9265193.326516</p><p>3.8153961.1218163.435723</p><p>4.0597821.3080543.492936</p><p>4.7890791.6634794.063329</p><p>5.5959521.731984.618952</p><p>6.2882311.8012575.293805</p><p>7.0109171.8426385.924449</p><p>8.4498921.8497287.207066</p><p>10.4577151.8832949.784505</p><p>6001.83777114.188138</p><p>7001.86157617.019212</p><p>8001.86302516.843023</p><p>9001.89366516.032364</p><p>10001.94382713.925054</p><p>15001.93452715.767073</p><p>20001.94339216.333523</p><p>25001.95793116.503304</p><p>30001.96230816.60465</p><p>40001.96885316.454164</p><p>50001.97579616.765595</p><p>60001.97304116.826278</p><p>70001.97321517.239769</p><p>80001.9763738000</p><p>90001.9781459000</p><p>100001.97730110000</p><p>150001.97759715000</p><p>Process</p><p>CthThreads</p><p>Pthreads</p><p>Number of flows</p><p>Context switch time (us)</p><p>Context Switching Time (Turing Apple Cluster)</p><p>Sheet1</p><p>11.3072420.8308391.311779</p><p>20.772860.8504550.90111</p><p>31.904790.8659861.908973</p><p>42.9926070.8660782.86693</p><p>53.7492480.8583123.363098</p><p>63.3885430.8599482.888108</p><p>73.7332930.852663.339983</p><p>83.0537650.8452612.897983</p><p>93.6421640.8448183.3543</p><p>103.1303910.8532292.964149</p><p>153.7288270.8575393.392727</p><p>203.194250.8557392.980291</p><p>303.2487630.8589673.03115</p><p>403.3423350.870833.121787</p><p>503.3860540.867923.197932</p><p>603.480980.8958183.229433</p><p>703.614770.9081363.27775</p><p>803.757890.9265193.326516</p><p>903.8153961.1218163.435723</p><p>1004.0597821.3080543.492936</p><p>1504.7890791.6634794.063329</p><p>2005.5959521.731984.618952</p><p>2506.2882311.8012575.293805</p><p>3007.0109171.8426385.924449</p><p>4008.4498921.8497287.207066</p><p>50010.4577151.8832949.784505</p><p>6001.83777114.188138</p><p>7001.86157617.019212</p><p>8001.86302516.843023</p><p>9001.89366516.032364</p><p>10001.94382713.925054</p><p>15001.93452715.767073</p><p>20001.94339216.333523</p><p>25001.95793116.503304</p><p>30001.96230816.60465</p><p>40001.96885316.454164</p><p>50001.97579616.765595</p><p>60001.97304116.826278</p><p>7...</p></li></ul>