9
CVMFS Post Mortem Doug Benjamin Duke University

CVMFS Post Mortem

Embed Size (px)

DESCRIPTION

CVMFS Post Mortem. Doug Benjamin Duke University. What happened?. PoolFileCatalog.xml became corrupt The relevant section of the file is -

Citation preview

Page 1: CVMFS Post Mortem

CVMFS Post MortemDoug BenjaminDuke University

Page 2: CVMFS Post Mortem

What happened? PoolFileCatalog.xml became corrupt

The relevant section of the file is -

<File ID="6651E9BA-061E-DD11-8F27-00304879FC6E“>

 <physical>    

   <pfn filetype="ROOT_All" name="/cvmfs/a<File ID="F80FEF94-CAF8-E011-8FBD-003048F0E7A2">  

 <physical>

   <pfn filetype="ROOT_All" name="/cvmfs/atlas-condb.cern.ch/repo/conditions/cond11/cond11_data.

000012.gen.COND/cond11_data.000012.gen.COND._0002.pool.root"/>

 </physical>

 <logical>

   <lfn name="cond11_data.000012.gen.COND._0002.pool.root"/>

 </logical>

</File>

The first <pfn filetype="ROOT_ALL" name="/cvmfs/a ...  is bogus.

Page 3: CVMFS Post Mortem

What happened (2)Lead cvmfs developer was cleaning the

repository and triggered the publishing of the bogus file.He did not know it was bogus (There is no way he

would have known)

Stratum 1 servers within 1 hour picked up the bogus file and published it. Cron jobs on Stratum 1 servers fetch files from the

Stratum 0 server hourlyCvmfs clients fetch files from the Stratum 1

servers whenever either time to live information expires or automount of cvmfs areas is triggered

Page 4: CVMFS Post Mortem

How was the PFC created The PoolFileCatalog.xml is create by a cron script that runs this command in loop:

where $dir_list isdir_list="oflcond cmccond comcond cond08 cond09 cond10 cond11 cond12 cond13 cond14 cond15 cond16 cond17 cond18 cond19 cond20"and ATLAS_POOLCOND_PATH isexport ATLAS_POOLCOND_PATH=/cvmfs/atlas-condb.cern.ch/repo/conditions

# loop over the directories

for dir in $dir_listdo # determine if there are any data sets ls -1 ${ATLAS_POOLCOND_PATH}/${dir}/* > /dev/null 2>&1 if [ "$?" = "0" ] then echo "running command - dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir}" >> $LogFile 2>&1 dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir} >> $PoolFileCatalogLog 2>&1 retcode=RC$? if [ $retcode != "RC0" ] ; then echo "Error - failed to update PoolFileCatalog - exiting " >> $LogFile 2>&1 echo "Error - failed to update PoolFileCatalog - exiting " exit 1 fi else echo "${ATLAS_POOLCOND_PATH}/${dir} does not have datasets" >> $LogFile 2>&1 fi done

Page 5: CVMFS Post Mortem

What was the immediate fix?The bogus lines were removed from the

PoolFileCatalog.xml

The cron job that does the file checkout and ultimate publishing was stopped and has not been restarted

Page 6: CVMFS Post Mortem

Why it happened?Not sure why the PoolFileCatalog creation failed?

Logs did not give any indication of the failure.

Did not have a backup PFC file.

Page 7: CVMFS Post Mortem

Remediation stepsUltimately use Alessandro DeSalvo’s sw-mgr

code to get the datasets, create the PFC (saves older version)Requires ATLAS software releases available on the

conditions db machine.Steve Traylen working on cvmfs mounts – It is a bit

tricky and troublesome

Run in cron job xml and file verification step from Misha Borodin

Page 8: CVMFS Post Mortem

Short term plansResume fetching of datasets to machine

Will be done manually (with same script w/o the publishing step)

Will run PFC file creation separately. Add xml format verification PFC file backup (keep a few copies)

Once everything looks good. Publish manually

Will update every day or so

Page 9: CVMFS Post Mortem

Intermediate plansOnce ATLAS code is available

Implement sw-mgr creation of PFC and fetch of the datasets. Initially will be done by handUltimately moved to cron job

Will add e-mail notification in case of failures