PP-479: Running subjobs to be able to survive a pbs_server restart

Hi,

I have created this EDD for implementing PP-479

Please provide your comments / feedback on it here

Thank you
Shrini

Hi Shrini,

Thanks for writing this document.
I have some first level comments…

  • Would this change also mean that one will be able to qalter a subjob after submission? If so, then I’d assume that the subjob information is serialized in the database before it starts running.
  • You have a typo in interface 4, point 1 word “obtained”
  • Can you please elaborate on Interface 5, why a subjob can not be marked as ‘X’ when finished instead of ‘F’.
  • In the Overview text of the document you mention unique information of subjob is now stored. Can you please mark exactly what all is stored for a subjob?

Thanks,
Arun

I guess this is more internal than external, but why are you getting rid of the tracking table and turning all of the queued subjobs into real jobs? When a subjob is run, it is turned into a real job. Why don’t you just save the real running subjobs when the server is shut down, so they can be recovered when the server comes back up?

If you go the route of turning all queued subjobs into real jobs, make sure to not change the pbs_statjob() interface for subjobs. This will break the scheduler.

Bhroam

Thank you Arun for you comments. my replies are as below

No, qalter ing a subjob is not enabled by this change.

Now the status of finished subjob is obtained from status_job() instead of status_subjob() and a status of finished subjob in job obj or db is “X” not “F”. The affecting code snippet is pasted below

---------------------------- src/server/stat_job.c ----------------------------
index 261740f..107d18a 100644
@@ -377,12 +377,11 @@ status_subjob(job *pjob, struct batch_request *preq, svrattrl *pal, int subj, pb
 	if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_ArrayJob) == 0)
 		return PBSE_IVALREQ;
 
-	/* if subjob is running, use real job structure */
+	/* if subjob job obj exists, use real job structure */
 
-	if (get_subjob_state(pjob, subj) == JOB_STATE_RUNNING) {
-		psubjob = find_job(mk_subjob_id(pjob, subj));
-		if (psubjob)
-			status_job(psubjob, preq, pal, pstathd, bad);
+	psubjob = find_job(mk_subjob_id(pjob, subj));
+	if (psubjob) {
+		status_job(psubjob, preq, pal, pstathd, bad);
 		return 0;
 	}

Well practically almost everything in a subjob is now “unique copy” since the subjobs “pjob” is pushed to job db table including its own set of job attributes which are pushed to job_attr db table. However all the subjobs of an array will have some of information entries same.

Thank you Bhroam for your comments

I agree this is more internal, but I was insisted for putting up an “EDD”

Im not getting rid of the tracking table from the memory cache, however I’m getting rid of the subjob tracking table from the database table.
@subhasisb please back me up here for removing the tracking table from database?

Nope, queued subjobs are not turned to real jobs in this design

Exactly what is being done

As I said, queued subjobs are not turned to real jobs. There is no change in pbs_statjob() and no change in scheduler as well

I’m sorry for appending 6th interface to EDD now.

I have a working experimental code in my dev branch.
If you are interested you can have a look.
please note that its NOT ready for review.

Its at the top commit named “save progress” in my dev branch: https://github.com/Shrini-h/pbspro/tree/PP-479

Hi – the change to allow subjobs to survive server restarts is great – thx!

It sounds like there are additional interface changes also being proposed, e.g., introducing a different finished state for subjobs (X versus F), changing how subjob status is accessed, possibly others. I feel these other changes should be avoided unless there is a compelling use case for them. And, even then, perhaps those interface changes can be split off into a separate ticket/check-in?

Is there a way to only fix the this one issue (allow subjobs to survive server restarts) without making all these other interface changes too?

Thank you @billnitzberg for you feedback.

As I mentioned in the interface, I was trying to disclose one of the side effect from my experimental design implementation as an interface just to “test the water”.

However, I worked on your suggestion and I was able to address it by finding a way to keep the current behavior on finished subjob state. So I have deleted this interface 5 in latest EDD update.

Hope that was your only concern and it has been addressed.

I have a couple of minor suggestions

  1. In interface 1 bullet 3, say “running subjob” instead of just subjob. The current text makes it sound that all subjobs will be created as job objects.
  2. Interface 2: Should removing the restriction on subjobs being always rerunable be part of this RFE or another? It seems tangential to this RFE. This RFE is about making subjobs survive a server restart. Whether or not you can do a qrerun on a subjob sounds somewhat different.
  3. Could you explain to me a little more about interface 6? If queued subjobs aren’t real jobs, then how can the run_* attributes be incremented if a job goes from R->Q. Will the subjob not be reabsorbed into the job array? I know that one of the run_* (I think run_version) attributes is part of the job array tracking table, but you are removing that from the database. Won’t that not persist across server restarts?

Bhroam

@bhroam, my feel is that the title of the enhancement is probably slightly limiting. I think the original idea was to make subjobs be treated like regular (non-array) jobs. In that vein, the point was not only to be able to have subjobs survive server restarts, but also start to show behavior similar to regular jobs, and thus we felt like removing the additional restrictions placed on sub-jobs makes sense.

Maybe we should broaden the topic of the enhancement to really say what we wanted “make subjobs first class citizens - aka have all the rights of a regular job”. Of course, if we want to limit changes, we can always implement part of the enhancement, but even discussing the entire enhancement could be beneficial. What do you feel?

About the tracking table: This was only required to be stored since we did not store the subjobs in the database. Now since we have the subjobs stored like regular jobs, we can always re-create the data we need - this, i feel, is better than having two different representation of the same data, which can potentially cause inconsistencies if we are not careful. At startup we would recreate the needed count information from the regular job table (instead of the tracking table), and thus we will not have to do 2 saves as well, just the regular job save would simply function well enough. Thoughts?

It might be worth coordinating PP-773 and PP-479, as both affect subjobs of array jobs.

This makes sense. Maybe changing the title of the RFE would be good.

So let me see if I get this straight. While the server is running, a job array will consist of a tracking table which contains all the queued subjobs and one job object per running subjob. When we save the job data into the database, we take the tracking table, “create” a job object for each subjob, and save that into the database? A 10000 way job array will create 10000 job entries in the database?

While I don’t know the performance impact on this, I’m not sure I see the point. If the queued subjobs are going to be in a tracking table in memory, why don’t we save that tracking table to the database? The tracking table only contains a small amount of data for each queued subjob, so why are we copying the vast majority of the data from the parent just to save it? We’ll have a lot of duplicate data.

If the point of this RFE is to make a subjob a first class citizen, I’d think we’d want to make the queued subjobs real jobs in memory. That way you can make any modifications to a particular subjob (e.g., change it mail points) if you wanted to.

I’m not the database expert, so maybe it isn’t a big deal to have all that duplicated job data in the database. If it isn’t, my worries are unfounded.

Bhroam

No, that is not the plan. Subjobs will be stored (saved) into the DB just like regular jobs, at the point of time of their creation in the server, and then on for usual state changes etc. So, the number of saves is exactly the same as before, i.e, the number of jobs that can pass through PBS (run throughput). The other reason to save every subjob like a regular job is that to be able to continue after a server restart, we need to store the exact job attributes like exec_xxx attributes…

However, the tracking table does not need to be saved. It can be just held in memory to server only one purpose, ie, to quickly respond to stat’s that ask for array counts…we would obviously not want to count the number of running, waiting, subjobs of a large array every time a stat is issued. So that in essence is just a cache, which will be created at server startup (from the DB will have subjobs in the regular job table) and the in-memory tracking table counts will be updated.

Does this sounds reasonable?

Yes, we should coordinate. Thanks for pointing out.

Done.

Since I now edited the EDD title to relflect the whole picture, I believe you wont have this concern anymore. Moreover, If new usable feature is got by removing three lines of code, then we should definitely go for it.

I checked the code and run_version is not part of the tracking table (both db and cache).
I believe once a a subjob hits R state it will have its job obj and job db. It will continue to have these even if it goes back to Q. i.e subjob wont get reabsorbed into the job array.

Now the title is changed as below

“External interface design for making subjobs on par with regular jobs and subjobs surviving server restarts.”

I completely agree that running subjobs need to be saved as a normal job. I still am confused about queued subjobs. In memory they exist in a tracking table. This means that only some attributes of a queued subjob can be modified. If this is the case, why are you writing more data than that out to disk? I understand that if you write the whole job structure out to disk you can recreate the tracking table from it. You’ll just go through an exercise of writing out extraneous data, and then when you read it back in, you’ll throw that data away. To me, that sounds wasteful. Either create all the subjobs as real jobs in memory, or just write the tracking table out to disk. Doing it half way seems wrong.

As a note, with the new title, it makes it sound that I can modify an attribute on a queued subjob. Can I? Can I do a qalter -a 2200 123[34]?

I also hesitate to do changes like this now because of what we’re talking about for PP-506. We’re changing the way job arrays exist in PBS for that RFE. Should we be making changes like this now?

Maybe we should revert the title back to what it was and truly just do that. Just make running subjobs survive a server restart. It’s a small bite sized piece of work.

Bhroam

I love the idea of making “subjobs on par with regular jobs”, but am concerned that it is a big effort and introduces a lot of risk for this late in the release cycle. Why not restrict this fix to the original enhancement (only addressing subjobs surviving server restarts), then make additional (more risky) changes in behavior for the next release?

Some things to consider if subjobs == regular jobs:

  • qalter – regular jobs can be altered when queued
  • peer scheduling – regular jobs can be peered
  • eligible_time – regular jobs each accrue eligible_time individually

I’m sure there are a lot of other things to consider too…

Not half way Bhroam. I believe we do not need to save either the queued jobs, or the tracking table in the database. We anyway save the array itself in the DB, and as jobs are instantiated (run and finished or requeued), they are saved like regular jobs. So what you have in the db is something like this:

123.svr[1-10] → Array parent itself
123.svr[5] → Running
123.svr[6] —> (instantiated in other ways, like ran and now requeued).
123.svr[9] → Finished

This above information is enough to recreate the tracking table in memory and populate the counts of running, queued, finished etc jobs.

Technically, yes, we should be able to qalter a queued subjob. Even an uninstantiated one, which can be “instantiated” in the db when there is a need. If we do not want to do this, we can restrict this easily as well.

I do see the need to discuss the intended behavior, which is why we wanted to discuss with the community. We wanted to discuss the entire larger view and then only implement the part we are clear on, in terms of the behavior.