Option to resume multinode jobs

I’ve written a design document proposing changes which would enable resuming multi-node jobs. I believe this could be useful for times when you need to restart PBS processes in your infrastructure (for example when PBS update comes out) and don’t want to start job from scratch or limit your users to only single-node jobs in period leading up to the planned restart.

I’ve already programmed a (hopefully) working solution, but I’ll be glad for any feedback that would make this better!


For some reason I can’t edit the original post, so here’s link to the proposal - https://openpbs.atlassian.net/wiki/spaces/PD/pages/2041839617/Add+option+to+resume+multinode+jobs

I think the following entry in the design can be removed now:
“When saving tasks to disk, save obits (so when mom restarts it knows it should send an obit to MS)”

Thanks for the suggestion, it’s removed now

1 Like