PP-289: unique job ids up to 1 trillion

Hello all,
As per the latest comments i have added a server attribute “max_job_sequence_id” to the EDD which was earlier in the v1 of the EDD.
As the discussion about keeping job id’s as sequential/non-sequential is going on, for the time being will implement max_job_sequence_id changes by keeping in mind that user’s need incremental numbers only.
Please have a look at the EDD and provide your valuable comments/sign-off.

Thanks @Bhagat this looks reasonable to me. The question of the sequentially incrementing requirement can be revisited later, if required.

Hi – thanks @Bhagat for working on this. Looking at the updated v.10 design, I have a couple questions:

  1. What is the type of max_job_sequence_id, and what are the legal values? By the name, it sounds like it is an integer (or string). If that is true, I suggest restricting allowed values to exactly the two numbers discussed 999999999999 and 9999999 (and otherwise returning an error). Alternatively, picking a name that doesn’t sound like an arbitrary integer would be OK too, e.g.,
    “big_job_ids = true”.

  2. Can an admin reduce the max sequence number (e.g., if it was previously changed from 9999999 to 999999999999, can they change it back to 9999999)? And, if yes, what happens (in the cases there are jobs with IDs greater than the current max sequence ID queued or in history)?

Thx!

Looks good to me so far as well. The possible additions to consider at the moment are:

  1. explicitly mention what happens when max_job_sequence_id is set to something smaller than the actual current next sequence id?

  2. Interface 2 is not specific about which qstat format(s) are affected. The example given is traditional “qstat”, but what about “qstat -a” (which also gets used when things like -n, -s, etc. are used)? Do we just assume that people can use wide output format with -a (etc.) if they need to and that this is adequate?

  3. What about pbs_rstat output format? Reservation identifiers (and their queue names) can now also be longer but the field widths are narrow.

This seems to be similar to what ScottC was proposing as well. I was feeling that this restriction was not required. Without this restriction, sites can configure this to any arbitrary range per their choice to have a large enough job id to cover for the duration to wrap, but at the same time not larger than they need. In other words, if they needed to increase just one character, should we not allow that?

Hi @scc if the max_job_sequence_id is set to something smaller than current max_sequence_id, it should only affect the next incoming job’s id. By the current code itself, if the next_sequence_id is > than the limit, then it would wrap to 0 and try from there. As such, i feel, we do not need to check whether jobs with larger ids are already circulating in the system; since the expanded qstat output width will accommodate it anyway. And yes, we should mention this in the EDD.

Hi @subhasisb and @Bhagat – thanks for the info and updates – the max_job_sequence_id seems good.

However, I am concerned about backward compatibility, especially with qstat (format), and to a lesser extent with job name (max size shrinking). Many sites have tools that parse the output of qstat, and those tools depend on the exact output and output format of qstat. Changing this and not providing a backward compatible option is bad.

It’s not completely clear from the design whether the default setting “max_job_sequence_id = 9999999” would retain the existing qstat output format sizes and job name maximum. If that is the intention, then there is no backward compatibility issue. (Of course, this creates the issue of how to handle qstat output format size (and job name size) if one increases the max_job_sequence_id, then shrinks it and there are jobs with “large” sequence IDs remaining in the queues/history).

If the proposal is to break qstat backward compatibility (columns sizes and positions), then I feel we need to be very careful and actively get more input before proceeding. This is one place in PBS Pro where there really is a lot of dependent tooling, and often no awareness nor skills to address it at many sites.

Thanks @scc for your valuable feedback.
I have mentioned in the EDD “if max_job_sequence_id is set to something smaller than the actual current next sequence id then it will wrap to zero(0)”.
Updated the Interface 2 and mentioned other qstat options (like qstat -s/t/a/st etc) also.
Added the interface 3 for pbs_rstat.
Please review it EDD and provide your feedback.

Thanks for the feedback @billnitzberg

Yes, we totally recognise the issue with qstat’s backward compatibility, and thus we want to discuss carefully before we proceed with updating the design.

There seems to be two possible ways to go about the qstat output (primarily, the modes of qstat that provide a tabular output. Options of qstat like -f that provide a full format with one attribute per line is not affected, right?)

For tabular qstats (qstat, qstat -s etc.), the options seem to be the following:

  1. Increase the field width (hardcode) to accommodate the max size possible width. So no matter whether a site changed the job id width or not, the qstat output is wider than today. This is the least backward compatible. (probably the most dangerous?)

  2. Do NOT increase the qstat width. Instead truncate the job.id to fit to the current job-id display width.
    a) This is already done in case of large server names, where the jobid.servername would have exceeded 15 characters. The numeric part of the jobid is still always visible. (existing/current behavior)

    b) If we truncate to fit to the same 15 character width (in qstat display), we will still accommodate the FULL numeric part of the max expanded jobid (which can max be 12 characters at the highest setting of max_job_sequence_id). This seems to be the most backward compatible, since we will still get the full numeric part of the jobid, the “.” and a couple of characters from the server name. Also, this truncation will only matter for sites which actually end up using very very large jobids in practice. For default users, no change.

  3. Have a dynamic qstat tabular output width. This is harder to implement since qstat has no idea of the max_job_sequence_id (today) - of course it can query it from server, but more code. The effect would be that the sites who use the default are not affected, and anybody who changes the width has a equivalent change in the qstat display width.

As of now we are preferring option (2) since we can change to option (3) anytime in the future. This is also assuming that the change in length of the jobid in qstat -f output does not affect parsers.

Thoughts highly appreciated.

Thanks @subhasisb, for the detailed explanantion.
I have written an example in the EDD which is showing how qstat will display the jobid after change.
and added that the max_job_sequence_id cannot set less than the default(9999999).
Please provide your feedback.

Hello @scc,
Yes,we need to change the pbs_rstat format because “Resv ID” and “Queue” widths are too narrow.
So,can we change the format ? because it’s important to achieve the trillion job id’s.
Please give me your opinion.

@Bhagat, it looks like the first part of interface 2 now does not match the example given.

As for pbs_rstat, this is more concerning than the option (2 above) to not change qstat output format width since the entire numeric portion of the reservation ID cannot be displayed (whereas with qstat it fits even without expanding). I’d think any scripts/tools that consume pbs_rstat output would rather have the formatting be different than unknowingly consume truncated information (in a way worse than just part of the .servername tag). I don’t think we have the option to “do nothing” here.

Given this, I think the absolute LEAST we could do is do nothing for qstat and add a wide output format option for pbs_rstat, but I’d rather see us do option 3 for both qstat and pbs_rstat: set the field width wider based on the max_job_sequence_id setting.

okay @scc we can proceed with option 3. If there are concerns with option 3 from others, please do chime in now.

Thanks @subhasisb and @scc for the feedback.
On the basis of latest feedback i have changed the EDD. If someone sets the max_job_sequence_id greater than the default value (9999999) then qstat header width will change or else qstat header will remain same as now. The above scenario is not applicable on qstat wide formats. The same condition is applicable for pbs_rstat header also.
Please review it.

I like it! Just to double check I understand:

  1. If no change is made to max_job_sequence_id, then there are zero changes to existing behavior (same field widths as with v18)

  2. If max_job_sequence_id is raised, then the field widths for qstat and pbs_rstat are increased. If a site needs backward compability, they should not do this :slight_smile:.

What happens to the field widths if max_job_sequence_id is first raised, then reset back to the default? Do the field widths go back to the v18 sizes? Do they only go back to the v18 sizes if their are no “big” sequence ids? If they don’t go back to the v18 sizes, what happens if the system reboots or during an overlay upgrade (how is the setting retained)?

Thanks @billnitzberg for reviewing it.
Your both points are right.
Que:What happens to the field widths if max_job_sequence_id is first raised, then reset back to the default?
Ans:Yes, If the max_job_sequence_id is raised first and then reset back to default only(i.e. 9999999) then only field width of qstat and pbs_rstat format will revert back to it’s default behavior.
Do they only go back to the v18 sizes if their are no “big” sequence ids? what do you mean by “big” sequence ids ? could you please elaborate more ?
we are going to take care of overlay upgrade and system reboots.Please let me know if you need more
info on this.

Great, thanks!

Sorry, I meant, what happens when first change the max_job_sequence_id to larger than 9999999, then I later change it back to 9999999, and there are objects in the system with job sequence IDs larger than 9999999 (e.g., queued jobs, history jobs, reservations)?

Thx again!

Hello @billnitzberg,
As per the current implementation, if someone firstly change the max_job_sequence_id greater than the default(9999999) value and later set to default again, then width will also reverts back to default for all the qstat and pbs_rstat options and their jobs(e.g., queued jobs, history jobs, reservations) as well whether they are greater/smaller than default value, all jobs will follow the default format.
Is this behavior acceptable ?
Please provide your feedback ? Thanks

Thanks @Bhagat. I think this will be fine, and it is simplier to understand, so is less likely to cause confusion (from that perspective).

I was thinking about the case where an admin increased the max ID, then realized it was causing problems with their tooling so they reset it back to the default, but after a few jobs in the system ended up with IDs > 9999999. In this case, those IDs might “overflow” the field widths. It’s a corner case, somewhat unlikely, and after thinking more about it, I like the direction you propose — because it is simplier to understand, I think that will outweigh the potential for “overflow” in this narrow case (where an admin made a mistake and corrected it).

Thanks!

1 Like

Looks good to me, thanks for the changes!