PP-479: Running subjobs to be able to survive a pbs_server restart

Shrini-h · January 17, 2018, 6:16am

Hi,

I have created this EDD for implementing PP-479

Please provide your comments / feedback on it here

Thank you
Shrini

arungrover · January 17, 2018, 6:35pm

Hi Shrini,

Thanks for writing this document.
I have some first level comments…

Would this change also mean that one will be able to qalter a subjob after submission? If so, then I’d assume that the subjob information is serialized in the database before it starts running.
You have a typo in interface 4, point 1 word “obtained”
Can you please elaborate on Interface 5, why a subjob can not be marked as ‘X’ when finished instead of ‘F’.
In the Overview text of the document you mention unique information of subjob is now stored. Can you please mark exactly what all is stored for a subjob?

Thanks,
Arun

bhroam · January 18, 2018, 2:16am

I guess this is more internal than external, but why are you getting rid of the tracking table and turning all of the queued subjobs into real jobs? When a subjob is run, it is turned into a real job. Why don’t you just save the real running subjobs when the server is shut down, so they can be recovered when the server comes back up?

If you go the route of turning all queued subjobs into real jobs, make sure to not change the pbs_statjob() interface for subjobs. This will break the scheduler.

Bhroam

Shrini-h · January 18, 2018, 4:57pm

Thank you Arun for you comments. my replies are as below

No, qalter ing a subjob is not enabled by this change.

Now the status of finished subjob is obtained from status_job() instead of status_subjob() and a status of finished subjob in job obj or db is “X” not “F”. The affecting code snippet is pasted below

---------------------------- src/server/stat_job.c ----------------------------
index 261740f..107d18a 100644
@@ -377,12 +377,11 @@ status_subjob(job *pjob, struct batch_request *preq, svrattrl *pal, int subj, pb
 	if ((pjob->ji_qs.ji_svrflags & JOB_SVFLG_ArrayJob) == 0)
 		return PBSE_IVALREQ;
 
-	/* if subjob is running, use real job structure */
+	/* if subjob job obj exists, use real job structure */
 
-	if (get_subjob_state(pjob, subj) == JOB_STATE_RUNNING) {
-		psubjob = find_job(mk_subjob_id(pjob, subj));
-		if (psubjob)
-			status_job(psubjob, preq, pal, pstathd, bad);
+	psubjob = find_job(mk_subjob_id(pjob, subj));
+	if (psubjob) {
+		status_job(psubjob, preq, pal, pstathd, bad);
 		return 0;
 	}

Well practically almost everything in a subjob is now “unique copy” since the subjobs “pjob” is pushed to job db table including its own set of job attributes which are pushed to job_attr db table. However all the subjobs of an array will have some of information entries same.

Shrini-h · January 18, 2018, 5:17pm

Thank you Bhroam for your comments

I agree this is more internal, but I was insisted for putting up an “EDD”

Im not getting rid of the tracking table from the memory cache, however I’m getting rid of the subjob tracking table from the database table.
@subhasisb please back me up here for removing the tracking table from database?

Nope, queued subjobs are not turned to real jobs in this design

Exactly what is being done

As I said, queued subjobs are not turned to real jobs. There is no change in pbs_statjob() and no change in scheduler as well

Shrini-h · January 18, 2018, 5:26pm

I’m sorry for appending 6th interface to EDD now.

I have a working experimental code in my dev branch.
If you are interested you can have a look.
please note that its NOT ready for review.

Its at the top commit named “save progress” in my dev branch: https://github.com/Shrini-h/pbspro/tree/PP-479

billnitzberg · January 18, 2018, 6:05pm

Hi – the change to allow subjobs to survive server restarts is great – thx!

It sounds like there are additional interface changes also being proposed, e.g., introducing a different finished state for subjobs (X versus F), changing how subjob status is accessed, possibly others. I feel these other changes should be avoided unless there is a compelling use case for them. And, even then, perhaps those interface changes can be split off into a separate ticket/check-in?

Is there a way to only fix the this one issue (allow subjobs to survive server restarts) without making all these other interface changes too?

Shrini-h · January 23, 2018, 4:04pm

Thank you @billnitzberg for you feedback.

As I mentioned in the interface, I was trying to disclose one of the side effect from my experimental design implementation as an interface just to “test the water”.

However, I worked on your suggestion and I was able to address it by finding a way to keep the current behavior on finished subjob state. So I have deleted this interface 5 in latest EDD update.

Hope that was your only concern and it has been addressed.

bhroam · January 24, 2018, 12:13am

I have a couple of minor suggestions

In interface 1 bullet 3, say “running subjob” instead of just subjob. The current text makes it sound that all subjobs will be created as job objects.
Interface 2: Should removing the restriction on subjobs being always rerunable be part of this RFE or another? It seems tangential to this RFE. This RFE is about making subjobs survive a server restart. Whether or not you can do a qrerun on a subjob sounds somewhat different.
Could you explain to me a little more about interface 6? If queued subjobs aren’t real jobs, then how can the run_* attributes be incremented if a job goes from R->Q. Will the subjob not be reabsorbed into the job array? I know that one of the run_* (I think run_version) attributes is part of the job array tracking table, but you are removing that from the database. Won’t that not persist across server restarts?

Bhroam

subhasisb · January 24, 2018, 10:39am

@bhroam, my feel is that the title of the enhancement is probably slightly limiting. I think the original idea was to make subjobs be treated like regular (non-array) jobs. In that vein, the point was not only to be able to have subjobs survive server restarts, but also start to show behavior similar to regular jobs, and thus we felt like removing the additional restrictions placed on sub-jobs makes sense.

Maybe we should broaden the topic of the enhancement to really say what we wanted “make subjobs first class citizens - aka have all the rights of a regular job”. Of course, if we want to limit changes, we can always implement part of the enhancement, but even discussing the entire enhancement could be beneficial. What do you feel?

About the tracking table: This was only required to be stored since we did not store the subjobs in the database. Now since we have the subjobs stored like regular jobs, we can always re-create the data we need - this, i feel, is better than having two different representation of the same data, which can potentially cause inconsistencies if we are not careful. At startup we would recreate the needed count information from the regular job table (instead of the tracking table), and thus we will not have to do 2 saves as well, just the regular job save would simply function well enough. Thoughts?

billnitzberg · January 24, 2018, 8:15pm

It might be worth coordinating PP-773 and PP-479, as both affect subjobs of array jobs.

bhroam · January 26, 2018, 1:11am

This makes sense. Maybe changing the title of the RFE would be good.

So let me see if I get this straight. While the server is running, a job array will consist of a tracking table which contains all the queued subjobs and one job object per running subjob. When we save the job data into the database, we take the tracking table, “create” a job object for each subjob, and save that into the database? A 10000 way job array will create 10000 job entries in the database?

While I don’t know the performance impact on this, I’m not sure I see the point. If the queued subjobs are going to be in a tracking table in memory, why don’t we save that tracking table to the database? The tracking table only contains a small amount of data for each queued subjob, so why are we copying the vast majority of the data from the parent just to save it? We’ll have a lot of duplicate data.

If the point of this RFE is to make a subjob a first class citizen, I’d think we’d want to make the queued subjobs real jobs in memory. That way you can make any modifications to a particular subjob (e.g., change it mail points) if you wanted to.

I’m not the database expert, so maybe it isn’t a big deal to have all that duplicated job data in the database. If it isn’t, my worries are unfounded.

Bhroam

subhasisb · January 29, 2018, 4:55am

No, that is not the plan. Subjobs will be stored (saved) into the DB just like regular jobs, at the point of time of their creation in the server, and then on for usual state changes etc. So, the number of saves is exactly the same as before, i.e, the number of jobs that can pass through PBS (run throughput). The other reason to save every subjob like a regular job is that to be able to continue after a server restart, we need to store the exact job attributes like exec_xxx attributes…

However, the tracking table does not need to be saved. It can be just held in memory to server only one purpose, ie, to quickly respond to stat’s that ask for array counts…we would obviously not want to count the number of running, waiting, subjobs of a large array every time a stat is issued. So that in essence is just a cache, which will be created at server startup (from the DB will have subjobs in the regular job table) and the in-memory tracking table counts will be updated.

Does this sounds reasonable?

subhasisb · January 29, 2018, 4:57am

Yes, we should coordinate. Thanks for pointing out.

Shrini-h · January 29, 2018, 3:10pm

Done.

Since I now edited the EDD title to relflect the whole picture, I believe you wont have this concern anymore. Moreover, If new usable feature is got by removing three lines of code, then we should definitely go for it.

I checked the code and run_version is not part of the tracking table (both db and cache).
I believe once a a subjob hits R state it will have its job obj and job db. It will continue to have these even if it goes back to Q. i.e subjob wont get reabsorbed into the job array.

Shrini-h · January 29, 2018, 4:46pm

Now the title is changed as below

“External interface design for making subjobs on par with regular jobs and subjobs surviving server restarts.”

bhroam · January 30, 2018, 1:23am

I completely agree that running subjobs need to be saved as a normal job. I still am confused about queued subjobs. In memory they exist in a tracking table. This means that only some attributes of a queued subjob can be modified. If this is the case, why are you writing more data than that out to disk? I understand that if you write the whole job structure out to disk you can recreate the tracking table from it. You’ll just go through an exercise of writing out extraneous data, and then when you read it back in, you’ll throw that data away. To me, that sounds wasteful. Either create all the subjobs as real jobs in memory, or just write the tracking table out to disk. Doing it half way seems wrong.

As a note, with the new title, it makes it sound that I can modify an attribute on a queued subjob. Can I? Can I do a qalter -a 2200 123[34]?

I also hesitate to do changes like this now because of what we’re talking about for PP-506. We’re changing the way job arrays exist in PBS for that RFE. Should we be making changes like this now?

Maybe we should revert the title back to what it was and truly just do that. Just make running subjobs survive a server restart. It’s a small bite sized piece of work.

Bhroam

billnitzberg · January 30, 2018, 1:26am

I love the idea of making “subjobs on par with regular jobs”, but am concerned that it is a big effort and introduces a lot of risk for this late in the release cycle. Why not restrict this fix to the original enhancement (only addressing subjobs surviving server restarts), then make additional (more risky) changes in behavior for the next release?

Some things to consider if subjobs == regular jobs:

qalter – regular jobs can be altered when queued
peer scheduling – regular jobs can be peered
eligible_time – regular jobs each accrue eligible_time individually

I’m sure there are a lot of other things to consider too…

subhasisb · January 30, 2018, 4:44am

Not half way Bhroam. I believe we do not need to save either the queued jobs, or the tracking table in the database. We anyway save the array itself in the DB, and as jobs are instantiated (run and finished or requeued), they are saved like regular jobs. So what you have in the db is something like this:

123.svr[1-10] → Array parent itself
123.svr[5] → Running
123.svr[6] —> (instantiated in other ways, like ran and now requeued).
123.svr[9] → Finished

This above information is enough to recreate the tracking table in memory and populate the counts of running, queued, finished etc jobs.

subhasisb · January 30, 2018, 4:48am

Technically, yes, we should be able to qalter a queued subjob. Even an uninstantiated one, which can be “instantiated” in the db when there is a need. If we do not want to do this, we can restrict this easily as well.

I do see the need to discuss the intended behavior, which is why we wanted to discuss with the community. We wanted to discuss the entire larger view and then only implement the part we are clear on, in terms of the behavior.

Topic		Replies	Views
Proposal of interface regarding hold/release of subjob(s) and job array Developers	35	2868	May 22, 2019
Understanding Array Jobs Developers	6	3292	February 8, 2017
Job array with pbspro Users/Site Administrators	21	4070	June 22, 2020
PP-465: qrerun timeouts when big job files are being copied from MoM to server Developers	44	4146	November 15, 2016
PP-389: Allow the admin to suspend jobs for node maintenance Developers	35	5161	July 13, 2017

PP-479: Running subjobs to be able to survive a pbs_server restart

Related topics