Checkpointing function for scheduler

sxy · August 3, 2016, 12:35am

Hi,

do you have any plans to expand pbs pro functions which would enable the check pointed jobs to run on any free compute nodes?

thanks,

Sue

mkaro · August 3, 2016, 3:52pm

Hello Sue,

There are currently no plans to migrate checkpointed jobs to alternate nodes. Migrating multinode jobs is tricky business, whereas single node jobs are more straightforward. To which are you referring? Are you using BLCR? MPI?

If the community believes this would provide sufficient value (versus the cost of the work) we could certainly consider it.

Thanks,

Mike

arungrover · August 4, 2016, 5:42am

Hi Sue,

Well what you are asking is in a way already supported in PBS.

Admin can configure PBS to use their checkpoint scripts (which can also use third party checkpointing tools like Meiosys Checkpoint and BLCR as pointed by @mkaro) and configure checkpointing type as “checkpoint_abort”. This will make the job to requeue and PBS scheduler may eventually end up running this checkpointed job on a different node in subsequent scheduling cycles.

For more information, I’d recommend reading section 9.3.2 of our admin guide - PBSPro Admin Guide

Hope it helps!

Regards,
Arun Grover

sxy · August 11, 2016, 6:03am

Thanks.

do you have any generic checkpoint scripts that we can have a look at?
or what would be your suggestions in regard to using checkpoint scripts?
do we have to use for example, BLCR for checkpointing jobs?

regards,

Sue

scott · August 11, 2016, 4:42pm

I have an example from a few years ago. I would expect it to still work because the PBS Professional interfaces for CPR had not changed - to my knowledge. Although, I do not see a way for me to share scripts on the forum.

I decided to publish them on my GitHub: https://github.com/scottaltair/PBS-Professional-CPR-Example

arungrover · August 12, 2016, 9:22am

Hi Sue,

I’m sorry I think I made a mistake in answering this query… After one of my colleague pointed out that even after a requeue the checkpointed job can only be resumed on the same node where it was checkpointed.

So for now, I’m not very sure if we can resume a checkpointed job on a different node.

Regards,
Arun

sxy · May 16, 2017, 2:39am

Hi,

i have tested checkpoint function with code that Scott provided.

two different jobs with MPI run on one compute node. one job is through a queue with higher priority and the other through a queue with lower priority. checkpoint didn’t work with the message received:

05/11/2017 17:14:54;0004;pbs_mom;Job;40.headnode;action checkpoint_abort script /var/spool/pbspro/mom_priv/checkpoint_abort.sh cannot be executed due to permissions
05/11/2017 17:14:54;0008;pbs_mom;Job;40.headnode;checkpoint failed: errno=0
05/11/2017 17:14:54;0008;pbs_mom;Job;40.headnode;req_holdjob: Checkpoint initiated.
05/11/2017 17:14:56;0080;pbs_mom;Req;req_reject;Reject reply code=15061, aux=2, type=7, from root@192.168.129.10:16001

code number 15061 is described as “not all tasks could checkpoint” in the reference manual.
what does it mean? if mpi jobs can’t be checkpointed, it is almost useless having pbspro checkpoint on clusters.

thanks,

Sue

sgombosi · May 16, 2017, 3:22am

Judging from the MOM log messages, you have a file permission problem on the checkpoint_abort.sh action script. Is the execute bit set on this file?

Steve

sxy · May 16, 2017, 3:25am

it is executable:

-rwxr-xr-x 1 sxy admin 853 Aug 11 2016 checkpoint_abort.sh

Sue

sgombosi · May 16, 2017, 3:34am

It needs to be owned by root. It’s a security violation to have root (i.e. the MoM process) running scripts that are owned or writeable by non-root users. See page 394 or the 14.2.1 Admin Guide:

“• Under Linux, the checkpoint script should be owned by root, and writable by root only, with permission 0755.”

By the way, if you’re using Scott’s sample checkpoint script from Github, there may be a typo in the kill command. It should read:

kill -TSTP …

not kill -SIGTSTP …

Steve

sxy · May 16, 2017, 5:30am

Hi Steve,

below is checkpoint_abort.sh I am using. I cant find file: ${PBS_JOBID}_data.chk anywhere. what does this file contain?

thanks, Sue

#!/bin/sh -x

� Copyright 2012 Altair Engineering, Inc. All rights reserved.

This code is provided �as is� without any warranty, express or implied, or

indemnification of any kind. All other terms and conditions are as

specified in the Altair PBS EULA.

Assumption:

Purpose:

exec >/tmp/checkpoint_abort.debug 2>&1

CHECKPOINTPATH=$1
if [ ! -d ${CHECKPOINTPATH} ]; then
mkdir -p ${CHECKPOINTPATH} || exit 1
fi

Source in PBS specific environment variables from pbs.conf

PBS_CONF=${PBS_CONF:-/etc/pbs.conf}
[ -f ${PBS_CONF} ] && . ${PBS_CONF}

JOB_JB=${PBS_HOME}/mom_priv/jobs/${PBS_JOBID}.JB
JOB_SC=${PBS_HOME}/mom_priv/jobs/${PBS_JOBID}.SC
PIDS=ps --sid ${PBS_SID} -o pid=

cp ${JOB_SC} ${CHECKPOINTPATH}/${PBS_JOBID}.SC
#kill -SIGTSTP ${PIDS}
kill -TSTP ${PIDS}
sleep 1
cp ${PBS_JOBDIR}/${PBS_JOBID}_data.chk ${CHECKPOINTPATH}
kill -15 ${PIDS}

sxy · May 16, 2017, 6:17am

Hi Steve,

Here attached is checkpoint_abort.sh I am using from Github by maintained by Scott. I can’t find ${PBS_JOBID}_data.chk anywhere on the system.
What is this file for?

Thanks,

Sue

scott · May 16, 2017, 4:46pm

Sue, please confirm the steps below, which are part of the of the README.

Steps to demo:
As root

Update the $PBS_HOME/mom_priv/config with the contents of mom_priv/config.example minding the PATHs to the scripts

Copy the checkpoint_abort.sh, checkpoint.sh, and restart.sh into mom_priv

Chmod 755 the new scripts

Restart PBS MOM

As user

qsub checkpointable_app.sh

tail -f ${PBS_JOBDIR}/${PBS_JOBID}.OU

PBS_JOBDIR can be determined by qstat -f | grep jobdir

In another terminal window, as root or the user.

qhold $PBS_JOBID

watch the output of the tail in the user’s window

qrls $PBS_JOBID

again watch the output of the tail

Allow the job to run, and you will see the period checkpoint kick in, too.

I would like to see the contents of the $PBS_HOME/mom_priv/config file on the system you have deployed the cpr demo scripts on.

cat $PBS_HOME/mom_priv/config

Also, please note that the example “checkpointable” job script (checkpointable_app.sh) is what writes out the file

pbs_cpr_demo/checkpointable_app.sh

Here is the snippet from
write_restart_file() {
echo date Writing restart file…
echo $number > ${PBS_JOBID}_data.chk
}

Scott

sgombosi · May 16, 2017, 4:49pm

A little background:

The important thing to remember here is that true system-level checkpointing (like that found in a lot of the old vendor-supported Unix systems like Unicos, HP-UX, or AIX) doesn’t exist in Linux. There’s no “checkpoint” system call that can be used to checkpoint a generic application.

What this means is that your application has to have some method of checkpointing itself, usually in response to a signal. Scott’s checkpoint_abort script is a sample of how one might write an action script to trigger such a self-checkpoint in an application that uses SIGTSTP to trigger a checkpoint. On Github, there’s a companion script that is intended to run as a demonstration “application” - it basically sleeps and increments a counter in a loop. It traps SIGTSTP and generates a “checkpoint file”. The checkpoint_abort script is designed to work in conjunction with that “application”. It’s not a “plug-and-play” solution to checkpointing any generic application.

Steve

sxy · May 19, 2017, 1:38am

so checkpoint function with pbspro doesn’t do more than what maui does.

Sue

sxy · May 31, 2017, 5:02am

generic checkpoint as different to application software self-checkpoint for pre-empt would be very useful.
nowadays, as big data science emerges, more and more simulations require large size of memory.
generic job checkpoint would provide an option for pre-empt function more reasonable than job suspension.
further more, if generic job checkpoint is made available, to get checkpointed jobs to run on different cores/nodes would be next step of the system development, which should be one of fundamental features in PBS system.

Sue

Topic		Replies	Views
How to write a script for a program run in two hosts? Users/Site Administrators	21	4754	January 17, 2019
Jobs stuck in R status after power failure Users/Site Administrators	9	106	July 4, 2024
Cannot delete Job after Checkpoint/Restart Users/Site Administrators	3	946	August 14, 2018
Job not getting distributed among nodes Users/Site Administrators	41	3078	June 19, 2022
Jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool Users/Site Administrators	5	65	September 9, 2024

Checkpointing function for scheduler

� Copyright 2012 Altair Engineering, Inc. All rights reserved.

This code is provided �as is� without any warranty, express or implied, or

indemnification of any kind. All other terms and conditions are as

specified in the Altair PBS EULA.

Assumption:

Purpose:

Source in PBS specific environment variables from pbs.conf

Related topics