Upload
perry-sloan
View
34
Download
3
Embed Size (px)
DESCRIPTION
CVMFS Post Mortem. Doug Benjamin Duke University. What happened?. PoolFileCatalog.xml became corrupt The relevant section of the file is -
Citation preview
CVMFS Post MortemDoug BenjaminDuke University
What happened? PoolFileCatalog.xml became corrupt
The relevant section of the file is -
<File ID="6651E9BA-061E-DD11-8F27-00304879FC6E“>
<physical>
<pfn filetype="ROOT_All" name="/cvmfs/a<File ID="F80FEF94-CAF8-E011-8FBD-003048F0E7A2">
<physical>
<pfn filetype="ROOT_All" name="/cvmfs/atlas-condb.cern.ch/repo/conditions/cond11/cond11_data.
000012.gen.COND/cond11_data.000012.gen.COND._0002.pool.root"/>
</physical>
<logical>
<lfn name="cond11_data.000012.gen.COND._0002.pool.root"/>
</logical>
</File>
The first <pfn filetype="ROOT_ALL" name="/cvmfs/a ... is bogus.
What happened (2)Lead cvmfs developer was cleaning the
repository and triggered the publishing of the bogus file.He did not know it was bogus (There is no way he
would have known)
Stratum 1 servers within 1 hour picked up the bogus file and published it. Cron jobs on Stratum 1 servers fetch files from the
Stratum 0 server hourlyCvmfs clients fetch files from the Stratum 1
servers whenever either time to live information expires or automount of cvmfs areas is triggered
How was the PFC created The PoolFileCatalog.xml is create by a cron script that runs this command in loop:
where $dir_list isdir_list="oflcond cmccond comcond cond08 cond09 cond10 cond11 cond12 cond13 cond14 cond15 cond16 cond17 cond18 cond19 cond20"and ATLAS_POOLCOND_PATH isexport ATLAS_POOLCOND_PATH=/cvmfs/atlas-condb.cern.ch/repo/conditions
# loop over the directories
for dir in $dir_listdo # determine if there are any data sets ls -1 ${ATLAS_POOLCOND_PATH}/${dir}/* > /dev/null 2>&1 if [ "$?" = "0" ] then echo "running command - dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir}" >> $LogFile 2>&1 dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir} >> $PoolFileCatalogLog 2>&1 retcode=RC$? if [ $retcode != "RC0" ] ; then echo "Error - failed to update PoolFileCatalog - exiting " >> $LogFile 2>&1 echo "Error - failed to update PoolFileCatalog - exiting " exit 1 fi else echo "${ATLAS_POOLCOND_PATH}/${dir} does not have datasets" >> $LogFile 2>&1 fi done
What was the immediate fix?The bogus lines were removed from the
PoolFileCatalog.xml
The cron job that does the file checkout and ultimate publishing was stopped and has not been restarted
Why it happened?Not sure why the PoolFileCatalog creation failed?
Logs did not give any indication of the failure.
Did not have a backup PFC file.
Remediation stepsUltimately use Alessandro DeSalvo’s sw-mgr
code to get the datasets, create the PFC (saves older version)Requires ATLAS software releases available on the
conditions db machine.Steve Traylen working on cvmfs mounts – It is a bit
tricky and troublesome
Run in cron job xml and file verification step from Misha Borodin
Short term plansResume fetching of datasets to machine
Will be done manually (with same script w/o the publishing step)
Will run PFC file creation separately. Add xml format verification PFC file backup (keep a few copies)
Once everything looks good. Publish manually
Will update every day or so
Intermediate plansOnce ATLAS code is available
Implement sw-mgr creation of PFC and fetch of the datasets. Initially will be done by handUltimately moved to cron job
Will add e-mail notification in case of failures