Cannot delete Job after Checkpoint/Restart

Hi,

I was testing checkpoint/restart on 18.1.2 and MoM was segfaulting during job delete while job was already in E state.
I ran into several other issues too. Going through GitHub issues, some of them seemed to have fix and been merged into master branch.


https://github.com/PBSPro/pbspro/pull/694
https://github.com/PBSPro/pbspro/pull/732

Therefore I decided to test master branch. Built the packages for CentOS 7.5.1804 and tested checkpoint/restart again. MoM does not segfault anymore, but after checkpointing/restarting the job cannot be deleted, and seems to be registered on server for ever. Even if processes on execution node are cleaned up. Way to delete (qdel) job would be to stop MoM, qdel -W force, and start MoM again. Another way that is sometimes working would be:

while : ; do qdel JOBID ; done

I used following repo https://github.com/scottaltair/PBS-Professional-CPR-Example.git to test checkpoint.
Only change I did was in checkpoint(_abort).sh script and changed “kill -SIGTSTP” to “kill -TSTP” line.

Can someone test this and confirm? Could it be that I found another Bug?

Thanks

Hi @mae,

Could you please pull the changes from here and let us know if the changes work for you. If yes, we will be merging the fix for the issue after the code review.

If not, we would need to create a new ticket.

Thanks,
Prakash

Hi @prakashcv13,

I just pulled those two commits, and built packages. It seems that they fixed behavior that I saw last week.
At least I can’t reproduce it anymore.

Thanks,
MaE

good to know that @mae.