PP-479: Running subjobs to be able to survive a pbs_server restart

Absolutely @billnitzberg. The thought is to discuss the large enhancements, but restrict the current implementation to only the “survive restarts” portion. We can do it either by keeping this discussion as is and mentioning that we are only implementing a set of interfaces right now, and others are up for discussion. Or else we can revert back the title to the earlier and create more discussion items later.

Additionally, maybe we need to find a better way to discuss broader topics and then be able to dive down into specifics for each release. @Shrini-h, maybe the best way forward for now is to revert back to the earlier “smaller” scope, and then tackle the other interfaces in separate RFEs, later. Meantime, we will prod the maintainers to consider how to effectively discuss larger topics which are too risky to cover in one implementation step. Thanks.

@Shrini-h Concerning the work coordination: As I can see, the discussion here is still open. I just want to inform you that I am ready to rise the PR for PP-773. Please, see the branch https://github.com/CESNET/pbspro/tree/PP-773. There are no big changes in the commit. Of course, I can wait with raising the PR. Please, let me know.

Vasek

Alright, Now I have

  1. reverted the title back to “External interface design for subjobs surviving server restarts.
  2. deleted interface 6 - can be addressed in a separate ticket in future

I have tried justifying on the remaining 4 interfaces below:

Interface 1 & 4 addresses the problem description in the Jira (the “smaller” scope.i.e survive restart portion)
The design approach mentioned by point 3 & 4 in Interface 1 will lay the ground work for tackling the bigger scope if any in future RFEs

Point 3 & 4 of Interface 1
  • This is achieved by making a running subjob on par with a single job and storing each subjob’s job object and its attributes into the pbs database’s job table and job_attr table which gets recovered during subsequent server start (pbsd_init())
  • “pbs.subjob_track” db Table is removed from pbs db schema

Interface 2 is a simple useful feature enabled by removing a few lines of restriction code.

Interface 3 is an obvious side effect of problem statement

acceptable?

@vchlum I agree your changes are not big.
Please go ahead with raising PR for your change.

Agreed! I think it would be a good idea to start a new discussion topic whenever a discussion is significantly expanded.

This gives the broader community a chance to see the new topic and join the new conversation that may be of interest. Not everyone follows every post on this forum :slight_smile:, and the way Discourse works (we are using for “Discourse” to manage community.pbspro.org) it’s easy to get notified when a topic is first created, and then decide whether to follow future posts to that topic or not. If someone decides not to follow a topic based on it’s initial scope, they would lose all future discussion on the new topic.

Thanks!

1 Like

Thanks for all the work on this! I like the additional features (allowing non re-runable job arrays and getting individual job array data for finished subjobs). One suggestion and a couple questions on these additional features:

Suggestion: If the final check-in includes the ability to have non re-runable subjobs, I suggest either adding a new RFE or changing the title of this RFE so that it’s easy to learn this new features exists.

Question: Are there any other externally visible ways to see finished subjobs besides qstat -xtf? Will this cause any summary qstat listings to grow (e.g., now showing potentially thousands of new lines, one each for each finished subjob)?

Question: What are the performance implications of keeping the individual finished subjobs in memory? Are there sites that are running millions of subjobs per day? How much will this change increase memory footprint? What about server responsiveness?

Again, thanks!

OK, I see. I personally wouldn’t say that each subjob was saved in the database as a job. If all queued subjobs are saved as part of the parent array job, then the subjobs aren’t individually being saved. This was my confusion. Maybe some rewording on this point is in order? Now that I understand what is going on, it sounds technically sound. You are right, saving the tracking table is no longer needed.

Instantiating a queued subjob is more work than just creating the job object and saving it to the database. The scheduler assumes all queued subjobs are part of the parent job array. It internally creates subjobs similarly to how the server does it when they are run. If this behavior changes on the server side, it will need to also change on the scheduler side. Plus right now the queued subjobs are “stored” in the parent array by the array_indicies_remaining attribute. Will this change if we instantiate one early?

I believe this is a much larger can of worms than we want to handle right now. I think we should keep the current restriction of not altering individual subjobs.

Since job array behavior is slated to change with PP-506, I am not sure we should talk about the bigger picture now. What we decide now will probably have to change when PP-506 is finalized.

Bhroam

Yes, if we did this, then code that used to look at only the parent array to know about queued jobs would need to change, so yes, it would be some work. I agree we should not change that for this release.

We should however have a mechanism to continue the discussion on this. I know PP-506 has discussed about impacting job-array behavior, but job arrays is not the core subject of PP-506. I think we need a discussion on where we want to take Job arrays going forward. Besides, waiting for another RFE to club discussions is not the right thing to do, since we tend to forget. Rather we should start a discussion now and when we revive PP-506, that RFE should take note of what was discussed about job arrays so far (of course influence both sides).

There is direct impact in this case. The sub jobs take space like regular jobs when running/finished, so when a lot of them accumulate, it is going to proportionally increase memory usage, and decrease server responsiveness (if a qstat is issued). The memory usage is as much as memory would be needed if all the finished subjobs were non-array regular jobs.

However, there is no impact in on job throughput performance. There could be impact on server restart, since we will load real subjobs instead of a tracking table into memory.

If this becomes a performance problem, we could consider options to cull history space more aggressively for subjobs (but that would make this more different from normal jobs)…so maybe to consider later?

I agree, Now I deleted the interface 2. Will create a separate RFE and check-in its code after merge of the current feature.

There wont be any major change other than in qstat -xtf. Further the qstat listings will provide more accurate info for a finished subjob , for example the cputime, comment fields etc. There is no change with respect to qstat listing on subjob in any state apart from the “-f” option (“full”) of qstat

your second question has been answered by subhasis in previous post above.

1 Like

I suggest we restrict this for now since a job-array is supposed to be uniform. So if they want to change this then they would need to change the parent not the sub-job.

OK, no impact on job throughput, but there is a potential slowdown (during qstats and restarts) as well as a memory footprint increase.

@jon – what do you think?

I don’t like the idea of impacting server performance with this change but I know that if we store the date in memory we will have to recover the data. Culling the sub-job data more aggressively if job history is on should help. As for the memory foot print, I would suggest that we clean up what we can once the sub-job has completed. Especially fields that can get quite large such as the environment and possible the node list.

Hi all

Thank you for the healthy discussion, I believe the forum has come to consensus on the current Design,

Now I will focus on dev work and getting the PR ready hoping that I have sign off on the Design,

If you differ on this please let me know

Thanks,
Shrini

1 Like

@Shrini-h, I believe that we should make individual ones re-runnable. In thinking about this more, currently, if a sub job gets sent to a bad node and fails then a user is not able to re-run the bad sub-job. By allowing a sub-job to be re-runnable we would be able to eliminate the need to have to run the array over again. Thoughts?

Jon

@jon As we will be focusing on the surviving the restart for this EDD, if you haven’t noticed, I have removed the interface 2 which stated the non-rerunnable feature.

My thoughts,
I believe a non-rerunnable job will be “rerunnable” untill an event that makes it non-rerunnable. Ideally we should try to automatically “rerun” the non-rerunnable sub-job which returned from a failed node. Off course we need to evaluate if the non-rerunnable sub-job can be still considered as “rerunnable” (i.e the event that makes it non-rerunnable should not have occured). Or we can have new interfaces to enable user to retrigger a failed non-rerunnable subjob. Hope i’m making sense here.

Anyways we can take it up in a new future RFE.

Thanks,
Shrini

The EDD has been updated with below information.

The below newly added line has been removed from EDD

The capability to qdel a running subjob is extended for finished subjobs in history.

@Shrini-h thanks for updating the EDD. We don’t include log messages in our EDDs any more. Could you remove the new one you just added? You should definitely keep the log message itself, just don’t add it to the EDD.

Bhroam