I was testing checkpoint/restart on 18.1.2 and MoM was segfaulting during job delete while job was already in E state.
I ran into several other issues too. Going through GitHub issues, some of them seemed to have fix and been merged into master branch.
Therefore I decided to test master branch. Built the packages for CentOS 7.5.1804 and tested checkpoint/restart again. MoM does not segfault anymore, but after checkpointing/restarting the job cannot be deleted, and seems to be registered on server for ever. Even if processes on execution node are cleaned up. Way to delete (qdel) job would be to stop MoM, qdel -W force, and start MoM again. Another way that is sometimes working would be:
while : ; do qdel JOBID ; done
I used following repo https://github.com/scottaltair/PBS-Professional-CPR-Example.git to test checkpoint.
Only change I did was in checkpoint(_abort).sh script and changed “kill -SIGTSTP” to “kill -TSTP” line.
Can someone test this and confirm? Could it be that I found another Bug?