62
ISC14 BoF: Drilling Down: Understanding User– Level Activity on Today’s Supercomputers ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

ISC14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's

Embed Size (px)

Citation preview

  • Slide 1

ISC14 BoF: Drilling Down: Understanding UserLevel Activity on Todays Supercomputers ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 2 Outline Brief presentation Open discussion Demo ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 3 Robert McLay TACC ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 4 My Passions Protect new user but stay out of vet's way Make staff support efficient and effective Automate detection, correction, prevention Make the repeat tickets go away! ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 5 Making a difference Maintain consistent, compatible software environment $ module swap mvapich2 impi Inactive Modules: 1) vasp Due to MODULEPATH changes the following have been reloaded: 1) fftw3/3.3.2 $ module load mvapich2 Lmod Error: You can only have one MPI module loaded at a time. You already have impi loaded. Lmod and related tools ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 6 Making a difference Detect potential problems and alert users TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** Lariat and related tools ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 7 Making a difference Job-level usage data on libraries and applications ALTD (Mark Fahey -- NICS) ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 8 Joining forces Detect potential problems and alert users Lariat Job-level usage data on libraries and applications ALTD XALT TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 9 My own not-so-hidden agenda... Looking for XALT beta users Hungry for ideas, needs, feedback Wanting to begin conversation with kindred souls ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 10 Mark Fahey UTK ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 11 ALTD What it does Intercepts linker (ln) and job launcher (aprun) Uses linker tracemap option to get all libraries Stores all of this in a database What it gets Full path of the executable Static and dynamic libraries used by the executable What it can be used for Which executables use the largest number of core hours? Are they managed by center? Do they use the system efficiently? Which libraries, applications, or tools are being used? Are there libraries we should remove? Are there libraries we should install? What percentage of executables are scripts? Are these scripts being used because the job starter isnt sophisticated enough? Are there any executables with modification times older than 1 year? Should we ask the user to recompile? ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 12 What Does NERSC Collect? ALTD Track library usage both at compile and run time Torque Logs Job information, accounting ALPS Logs Track applications run time data and options on the Cray systems Darshan IO profiling data IPM MPI profiling data Performance Monitoring Monitoring system performance over the life time of the machines LMT Lustre data ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 13 ALTD is enabled on all major computing platforms at NERSC ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 14 Applications of ALTD Understanding current library usage and plan for future software need Providing usage statistics to developers and vendors Restoring the program environment where user applications were built Assisting with debugging system issues An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 15 ALTD at CSCS In production at CSCS since 2011 Rock solid: just a single downtime in two years Rosa (Cray XE6) since March 2011 600K compilations, 2.8M jobs Todi (Cray XK6/XK7) since October 2012 470K compilations, 500K jobs Daint (Cray XC30) since March 2013 100K compilations, 550K jobs Weve added an additional SQL table accounting which logs more data about the application execution number of cores used, number of cores claimed, number of threads, MPI processes, processes per node, We want to be able to detect situations like the use of a buggy or non-performant library ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 16 How we mine data: a hypothetic situation A critical bug has been identified in FFTW version 3.3.0.2, affecting code correctness ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 17 First, find which users have linked this library mysql> select distinct username from altd_rosa_link_tags,altd_rosa_linkline where altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' ; +----------+ | username | +----------+ | tkachenn | | boswald | | liang | | robinson | | yunding | | zilia | +----------+ 5 rows in set (4.33 sec) Querying the ALTD database reveals that several users have applications linked to the buggy library ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 18 Now, check if they are using the buggy application And its confirmed that user robinson is running the application linked to the buggy library Its now up to the user services group to contact the user and recommend relinking their applications against the newer version of FFTW, which has fixed the bug mysql> select altd_rosa_jobs.* from altd_rosa_link_tags,altd_rosa_linkline,altd_rosa_jobs where altd_rosa_jobs.tag_id=altd_rosa_link_tags.tag_id and altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' and altd_rosa_jobs.username="robinson"; +---------+--------+------------------------+----------+------------+--------+---------------+ | run_inc | tag_id | executable | username | run_date | job_id | build_machine | +---------+--------+------------------------+----------+------------+--------+---------------| | 2410158 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410172 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410198 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410222 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | +---------+--------+------------------------+----------+------------+--------+---------------| 4 rows in set (0.65 sec) ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 19 This methodology is clearly unmanageable! Ideally, user support specialists would be alerted automatically to situations of interest Users running applications linked to legacy, less-performant, or buggy libraries Users running legacy versions of applications Users building code with legacy compilers Users making use of their own libs or apps, when more optimized versions are available centrally How can we automate the processes of data mining, reporting and alerting? ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 20 Porting ALTD on IBM x86 and BGP Neser x_86, ALTD worked instantly for linking stage ( Intel and GNU compiler) Postponed temporarily tracking the executable ( several OpenMPI versions 1.4.3, 1.5.4, 1.6.4,1.6.5, 1.7.3, 1.8.1) [x2 for Intel and GNU] More effort needed for BGP On this machine at least 4 sets of compilers are available : Native IBM or GNU For the front (64 bits) or core nodes (32 bits) Every one of them has its dedicated linker only patching/wrapping ld and mpirun was no more possible or sufficient XL_LINKER environment available ( force linker to use ALTD like for Cray compilers) XL_LINKER F ALTD_PATH+'/alias/dev/xlf_linker/xlf_xx._xx.cfg Need to specify the appropriate ld with the corresponding.cfg file ISC14 : BoF ALTD at KAUST Supercomputin g Lab 20 Slide 21 New implementation of ALTD based on bash aliases Each alias points to a unique python script $ALTD_ENV/rerouting/altd_alias.py an analysis of the command line gives the name of the original command along with some interesting parameters (name of the compiler, executable or job file, requested number of processors) depending on these information it eventually executes commands to add new information to the database, modifies of the running environment, or patches a job file before submitting it runs the same command line, this time in an environment unaware of the aliases. in case of error, the original command runs in an environment unaware of the aliases. The error is logged in a file and a mail is sent to ALTD maintainers. It gathers the encountered exception or error as well as numerous information on the users environment at that time Known Issue: threaded IBM compiler, mpixf90_r, mpixlc_r Error at runtime about Insufficient memory to start application disabled aliases for mpi._r compilers ISC14 : BoF ALTD at KAUST Supercomputin g Lab 21 Slide 22 ALTD Early mining ( 2months in full production) ALTD fully ported for x86 and BGP Identified: few compiled codes, (our users are mostly black-box users or in productions) Enabled storing jobs even if no ALTD header ( we store tag_id=0) Linking many users linked with libraries built in their home directories Execution Climate modeling ( mainly pre-processing) Combustion ( S3D, NGA) Molecular dynamic ( VASP, NAMD,) On going Relink major software installed by the staff to track them better additional tracking of other statistics (number of requested nodes vs actual used ) work on a web-interface for automatic mining. ISC14 : BoF ALTD at KAUST Supercomputin g Lab 22 Slide 23 TACC_Stats Job-level transparent performance monitoring from HPC compute nodes CPU performance counters IB statistics Lustre statistics Scheduler job statistics Host data OS statistics Analyses integrate available Lariat data (XALT in the future) ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 24 XALT: Understanding the Software Needs of High End Computer Users Newly NSF funded project Is combining the best of Lariat and ALTD Collecting job-level and link-time level data and subsequent analytics Building a community around analytics potentially one of many tools Will make it available to the community Optional interface to XDMod/SUPREMME ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 25 XALT Goals Goal is a census of libraries and applications and automatic filtering of user issues what additional user problems can we detect and report (perhaps correct) automatically? How can we leverage lessons learned by the tacc stats team to implement additional automatic filtering? Plan to add tracking of function calls as well Want to balance the need for portability with support for site-specific capabilities Want to simplify the processes system administrators use to install, configure, and manage ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 26 XALT Agenda New tracking infrastructure XALT Alpha version available today Deployed at NICS and TACC LANL and CSCS testing it Some new functionality still to add Detect function calls Check runtime environment versus compile time env [email protected] Want feedback, hungry for ideas ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 27 Thanks to Richard Gerber and Zhengji Zhao, NERSC Tim Robinson, CSCS Bill Barth, TACC Bilel Hadri, KAUST Julius Westerman, LANL ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 28 Contact Info Mark R. Fahey [email protected] [email protected] Robert McLay [email protected] [email protected] ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 29 Background Slides ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 30 Lariat User #1 Bill Barth Director of HPC, TACC Co-PI SUPreMM ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 31 TACC Stats Job-level transparent performance monitoring from HPC compute nodes CPU performance counters IB statistics Lustre statistics Scheduler job statistics Host data OS statistics Analyses integrate available Lariat data ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 32 Nightly Analyses Automatically analyzes jobs nightly Highlights jobs worth looking at Tries to provide a one-stop view of a job for Support staff Sysadmins And soon, users ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 33 Current Reports High levels of imbalance Low Flops (but other activity) Idle hosts Catastrophic performance drop ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 34 Slide 35 Slide 36 Richard Gerber Zhengji Zhao NERSC User Services NERSC Job Data ISC14 BoF: Drilling Down: Understanding User- Level Activity on Today's Supercomputers Slide 37 What Does NERSC Collect? ALTD Track library usage both at compile and run time Torque Logs Job information, accounting ALPS Logs Track applications run time data and options on the Cray systems Darshan IO profiling data IPM MPI profiling data Performance Monitoring Monitoring system performance over the life time of the machines LMT Lustre data ISC14 BoF: Drilling Down: Understanding User- Level Activity on Today's Supercomputers Slide 38 Expose job data via the web We try to make as much data available as possible via the web For users to track usage For users to check resource utilization For users to monitor performance For staff to help debug jobs For summary reports The following are web screen shots All data collection is transparent to users ISC14 BoF: Drilling Down: Understanding User- Level Activity on Today's Supercomputers Slide 39 Slide 40 Slide 41 Slide 42 Slide 43 Slide 44 Slide 45 Slide 46 ALTD is enabled on all major computing platforms at NERSC ISC14 BoF: Drilling Down: Understanding User- Level Activity on Today's Supercomputers Slide 47 Applications of ALTD Understanding current library usage and plan for future software need Providing usage statistics to developers and vendors Restoring the program environment where user applications were built Assisting with debugging system issues An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a ISC14 BoF: Drilling Down: Understanding User- Level Activity on Today's Supercomputers Slide 48 Monitoring Software Usage at CSCS Dr Tim Robinson CSCS Drilling Down: Understanding User-Level Activity on Todays Supercomputers Slide 49 We support many, many libs, tools, apps, compilers Slide 50 How can we tell if software is actually being used? Support staff want to answer questions like is anyone using a legacy version of a certain library or application? Research teams, software developers, vendors, and funding agencies need to know is their software being used? Where should they place effort for future development? User surveys? Counting module loads? A better solution: the Automatic Library Tracking Database (Fahey, Jones, and Hadri, Cray User Group meeting, 2010) Slide 51 The Automatic Library Tracking Database ALTD records information every time an application is linked and every time the resulting executable is launched on the compute nodes This is done by intercepting the GNU linker and the job launcher ALTD records the entire link line so it can be used to determine ancillary information about the compilation, such as which compiler suite was used to build the application Extremely lightweight essentially no overhead at compile/launch time Only tracks libraries that are actually used in the application, not all libraries that appear on a users link line Slide 52 ALTD at CSCS In production at CSCS since 2011 Rock solid: just a single downtime in two years Rosa (Cray XE6) since March 2011 600K compilations, 2.8M jobs Todi (Cray XK6/XK7) since October 2012 470K compilations, 500K jobs Daint (Cray XC30) since March 2013 100K compilations, 550K jobs Weve added an additional SQL table accounting which logs more data about the application execution number of cores used, number of cores claimed, number of threads, MPI processes, processes per node, We want to be able to detect situations like the use of a buggy or non- performant library Slide 53 How we mine data: a hypothetic situation A critical bug has been identified in FFTW version 3.3.0.2, affecting code correctness Slide 54 First, find which users have linked this library mysql> select distinct username from altd_rosa_link_tags,altd_rosa_linkline where altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' ; +----------+ | username | +----------+ | tkachenn | | boswald | | liang | | robinson | | yunding | | zilia | +----------+ 5 rows in set (4.33 sec) Querying the ALTD database reveals that several users have applications linked to the buggy library Slide 55 Now, check if they are using the buggy application And its confirmed that user robinson is running the application linked to the buggy library Its now up to the user services group to contact the user and recommend relinking their applications against the newer version of FFTW, which has fixed the bug mysql> select altd_rosa_jobs.* from altd_rosa_link_tags,altd_rosa_linkline,altd_rosa_jobs where altd_rosa_jobs.tag_id=altd_rosa_link_tags.tag_id and altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' and altd_rosa_jobs.username="robinson"; +---------+--------+------------------------+----------+------------+--------+---------------+ | run_inc | tag_id | executable | username | run_date | job_id | build_machine | +---------+--------+------------------------+----------+------------+--------+---------------| | 2410158 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410172 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410198 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410222 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | +---------+--------+------------------------+----------+------------+--------+---------------| 4 rows in set (0.65 sec) Slide 56 This methodology is clearly unmanageable! Ideally, user support specialists would be alerted automatically to situations of interest Users running applications linked to legacy, less-performant, or buggy libraries Users running legacy versions of applications Users building code with legacy compilers Users making use of their own libs or apps, when more optimized versions are available centrally How can we automate the processes of data mining, reporting and alerting? Slide 57 Drilling Down: Understanding User-Level Activity on Today's Supercomputers Mark R. Fahey SC BoF November 20, 2013 ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 58 XALT: Understanding the Software Needs of High End Computer Users Newly NSF funded project Will be combining the best of Lariat and ALTD Collecting job-level and link-time level data and subsequent analytics Building a community around analytics potentially one of many tools Will make it available to the community Optional interface to XDMod/SUPREMME ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 59 XALT Goals Goal is a census of libraries and applications and automatic filtering of user issues what additional user problems can we detect and report (perhaps correct) automatically? How can we leverage lessons learned by the tacc stats team to implement additional automatic filtering? Plan to add tracking of function calls as well Want to balance the need for portability with support for site-specific capabilities Want to simplify the processes system administrators use to install, configure, and manage ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 60 Related Tool tacc_stats is a related component in a collection of related initiatives/tools measures performance at the level of the individual job every job Also has an automated process of identifying issues that deserve attention/resources These tools can work together ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 61 Marks not so hidden agenda Do you know what libraries are being used? Do you know how many users have trouble with runtime environment matching compile time? Would you have strong opposition to intercepting the linker "ld"? Anyone willing to be a beta tester for our before and after study? Do you have any issues with dropping dot files in user home directories? Do you want to track library function calls? Will be forming a mailing list; we want you to join Want feedback, hungry for ideas ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers Slide 62 Contact Info Mark R. Fahey [email protected] [email protected] Robert McLay [email protected] [email protected] ISC14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers