How can I configure PBS so that afterok will wait for the stageout operation of the parent job to complete before staging in the child job?

Hi everyone,

I have a series of jobs that have to run in a specific order. Each job needs to be able to read the results generated by the previous job in order to start properly.

My cluster is configured so that each compute node has a SSD dedicated to hosting files for current jobs running on that node, so I’m using stagein and stageout to put files on, and get files from the compute node that PBS assigns to the job.

the problem I’m running into is that when I use:

#PBS -W depend=afterok:[parent job number]"

the child job will start its stagein operation before the parent job has completed its stageout operation, and when the child job tries to execute, it will be missing files and fail to run.

I have been looking through the various manuals to try and find some configuration option I can change to have jobs using afterok wait for the stageout of the parent job to complete, however, I haven’t found any such configuration option so far.

How can I configure PBS, or re-write my submission script so that the child jobs will have the files they need?



P.S. here is a sample PBS script for one of my child jobs

#PBS -N apply_current
#PBS -j oe
#PBS -o apply_current.out
#PBS -W sandbox=PRIVATE
#PBS -l select=1:ncpus=24:mpiprocs=24
#PBS -l abaqus_tokens=19
#PBS -l abaqus_count=19
#PBS -l walltime=10:00:00
#PBS -W stagein=".@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current/uamp.o,.@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current/apply_current.inp,.@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current/"
#PBS -W stageout="apply_current.abq@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.dat@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.mdl@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.msg@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.odb@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.pac@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.prt@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.res@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.sel@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.size@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.sta@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.stt@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.sim@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current,apply_current.use@rice:/home/mgesing/Documents/sandbox/tribometer/step_2-apply_current"
#PBS -W depend=afterok:5671
abaqus python

You can use runjob hook or execjob_begin hook to make sure all the data required to run this job exists (or copy it from other location) , if not reject the job which will put the job back in the queue. In this case your dependent job would not start to run when its data is not available