PP-289: unique job ids up to 1 trillion

Hi All,

Following is the link to the design document for enhancing the unique job IDs to One Trillion:
https://pbspro.atlassian.net/wiki/display/PD/PP-289%3A+unique+job+ids+up+to+1+trillion

Please review the proposed design and provide the comments/feedback for same.

Regards,
Varun

Hi @varunsonkar,

So the purpose of setting the limit is to restrict field width as opposed using the maximum number a 64bit integer could actually hold? Please note that the old maximum could fit within a 32bit integer, while the new one requires a 64bit integer. This is actually a big concern for 32bit systems. Do we need to explore this in the design?

Thanks,

Mike

Hi Varun,
I usually like the idea of not hard coding values into PBS and providing a knob. We usually get ourselves into trouble when just hard coding a value. I’m not sure this is one of those cases. What is the value provided to the admin to turning this knob? Why wouldn’t we just set it to something huge (like maybe the max of an unsigned long)? Is there a case when they’ll want to turn this down? Why not always have unique job ids if we can?

Bhroam

I think @bhroam is right.
Also,
Exposing this attribute leads more in-consistence in job-id … Like “What if the value is set lower than current job ID?”

Hi @bhroam,
Thanks for the suggestion/feedback.
We are now, not allowing to set/unset the jobs’s maximum Id. This would be the same behavior as earlier. Only we are increasing the range to 1 trillion from 10 million.
We believe this will be knob which is high enough for sites. As simple calculation suggests that sites submitting around 10k jobs per sec continuously will take over 3 years to consume 1 trillion job Ids.
We have now edited the design document for same. Please review it.
Regards,
Varun

Hi @mkaro,
Thanks for your reply.
We could not fully understand your point about “This is actually a big concern for 32bit systems” .
Could you please elaborate a little bit more on the possible use case you are referring to.
Regards,
Varun

Are there cases in the PBS Pro source code where the sequence number of a job is stored within an int or a long? If so, any sequence number larger than 2147483647 (2^31 - 1) could cause a problem for 32bit processors or systems using 32bit compilers. We’ll need to make sure that we don’t overflow in these cases. As a reference:
https://en.wikibooks.org/wiki/C_Programming/C_Reference/limits.h

Is this only a problem on a 32bit system or compiler if the system grows to more that 2B+ jobs? Also, do we want to continue to support 32bit systems?

Most of the places inside PBS treat the jobid as a char array, eg the database column is a TEXT field, the ji_jobid member in the job_qs structure is a char array as well, and so are the database queries around it.

CREATE TABLE pbs.job (
ji_jobid TEXT NOT NULL,
ji_sv_name TEXT NOT NULL,

char ji_jobid[PBS_MAXSVRJOBID+1]; /* job identifier */ --> ji_qs

struct pbs_db_job_info {
char ji_jobid[PBS_MAXSVRJOBID + 1]; /* job identifier /
char ji_sv_name[PBS_MAXSERVERNAME + 1]; /
server id */

So the (possibly only) part we have to be careful about is how (or what type of variable we use) to increment the id to get the next one in sequence. One idea was to simply generate it from a sequence in the database which can handle huge values anyway (even on 32 bit systems).

Also as Jon pointed out, we probably don’t need to support 32 bit systems anyway.

In src/incude/server.h the variable sv_jobidnumber is defined as an int. If this is a 32 bit compiler (which is likely for Windows) this value could overflow. We would be safe to use a “long long” or “long long int” to represent this number even for 32 bit hardware and compilers.

Hi @subhasisb,
Thanks for the comment.
I think “PBS_MAXSVRJOBID” and the variables declared using it are not a problem. As you see following is how “PBS_MAXSVRJOBID” has been defined:
#define PBS_MAXSVRJOBID (PBS_MAXSEQNUM - 1 + PBS_MAXSERVERNAME + PBS_MAXPORTNUM + 2) /* server job id size, -1 to keep same length when made SEQ 7 */
Which has worked fine for 32 bit systems.
Having said that there are some places where variables like “sv_jobidnumber” have been declared as “int” also in database as well which can cause issues on 32 bit.
We are analyzing the code with the possible affects and probable solution.
Regards,
Varun

Thanks @mkaro,
Yes we are also moving ahead with our analysis with the same approach (long long).
But we want to be thorough with that we are not breaking somethings like if sizeof(*pointer) is done or comparison is done.
Up till now we have not found anything alarming when we use “long long”.
Once we are done with analysis we will update the same.
Regards,
Varun

Just a quick note: the scheduler queries the job id and shoves it into an int. Look at src/scheduler/data_types.h in the structure job_info

Hi @bhroam,
Thanks for the input. We will analyze and handle it accordingly.

After thinking about this more, I don’t feel that we need to support 32 bit processor. My vote is we move past the 32 bit era to 64 bit. However, if the community feels like we have to support 32 bit processors/compilers then it makes since to set the max to 2147483647 (2^31 - 1). The other thought is that if it significantly reduces risk and development time to just increase the max job id to 2147483647 then maybe we should consider reducing the max job id to the max 32 bit int value.

Changing PBS_MAXSEQNUM from 7 to 12 might affect how output/error files formed as their name structure is “{JOB_NAME}.{JOBID}o” or “{JOB_NAME}.{JOBID}e” . Since system limit for max_file_name_length is 255 we may have to manage the job name length to accommodate this new job id length.

Hi @ashwathraop,
Thanks for the input. We will take care of this as well.

The real issue here is identifying the underlying C data type we use for representing the sequence number. I believe that long long int or long long unsigned is the correct choice because they are the same size (64 bits) regardless of the underlying architecture or the compiler being used.

I agree with Jon that we should leave 32bit architectures behind. It has been quite a long time since chips/OSes became 64bit. While Mike’s point is well taken that we could use a long long, I suggest we use an unsigned long. I think long longs are 128bits these days. That’s a bit overkill. If we use an unsigned long and a long is 32bits, it’ll just cycle back around to 0 earlier than we intended all on its own.

Bhroam

One question and one suggestion:

Question: Should we consider either (a) no upper bound, and/or (b) extending Job IDs to alphanumerics?

Suggestion: If the intention is to have 64 bits, it is best to be explicit:

#include <stdint.h>

int64_t job_id;

And, I believe the above is part of the C99 standard, so it should work most everywhere. (Of course, the C99 standard also mandates that long long int is at least 64 bits, so if it’s just a minimum we want, that would also work.)