PP-289: unique job ids up to 1 trillion

I like the idea of using the stdint types. Since we don’t have negative sequence numbers, we could use uint64_t to take the maximum from 9,223,372,036,854,775,807 (INT64_MAX) to 18,446,744,073,709,551,615 (UINT64_MAX).

The upper bound really controls how many job IDs can be active (queued, running, held, exiting, etc.) at any given time. This will always be something less than infinity, so I think having an upper bound is fine for our purposes.

Any decimal number beyond about ten digits starts getting a little intimidating for us puny humans. Even with hexidecimal we’re looking at a string of 16 hex digits to represent UINT64_MAX. That’s a little better than 20, but not much. Even with hexatridecimal (base 36) it’s still 13 characters, with the possibility of some vulgar words embedded.

Hi All,
Thanks for inputs. As per the recent replies we can think of the following possible approaches:

Approach 1: Use long long with the upper bound.
Approach 2: Use int64_t. The issue here might be that, since Microsoft Visual Studio is not fully C99 compliant. We are not sure if int64_t is supported.
Approach 3: Use long long or unsigned long long with no upper bound. Let it go to the max value and wrap itself.
Approach 4: Use long or unsigned long without upper bound. Assuming we ignore 32bit. On both architectures reaching the max values, they wrap around. Though max values are different.
Approach 5: Use long or unsigned long with upper bound. Again assuming we ignore 32bit. In 64bit, wrap around happens at the upper bound while on 32bit, it happens earlier on reaching the max value.

As of now we are working on “Approach 1”.
Please vote on the “Approach” with which we should go ahead. Or please suggest, if I have missed something.

FWIW, I wrote a small program and ran it on a VM to see just how long it would take to increment a uint64_t and measure the amount of time it took to consume each bit in the variable. Obviously, performance will vary, but it gives an idea of how large the domain should be. I gave up after 48 bits…

$ gcc -Ofast -o counter counter.c
$ ./counter
2 (1 bits) (0 seconds)
4 (2 bits) (0 seconds)
8 (3 bits) (0 seconds)
16 (4 bits) (0 seconds)
32 (5 bits) (0 seconds)
64 (6 bits) (0 seconds)
128 (7 bits) (0 seconds)
256 (8 bits) (0 seconds)
512 (9 bits) (0 seconds)
1024 (10 bits) (0 seconds)
2048 (11 bits) (0 seconds)
4096 (12 bits) (0 seconds)
8192 (13 bits) (0 seconds)
16384 (14 bits) (0 seconds)
32768 (15 bits) (0 seconds)
65536 (16 bits) (0 seconds)
131072 (17 bits) (0 seconds)
262144 (18 bits) (0 seconds)
524288 (19 bits) (0 seconds)
1048576 (20 bits) (0 seconds)
2097152 (21 bits) (0 seconds)
4194304 (22 bits) (0 seconds)
8388608 (23 bits) (0 seconds)
16777216 (24 bits) (0 seconds)
33554432 (25 bits) (0 seconds)
67108864 (26 bits) (0 seconds)
134217728 (27 bits) (0 seconds)
268435456 (28 bits) (0 seconds)
536870912 (29 bits) (0 seconds)
1073741824 (30 bits) (0 seconds)
2147483648 (31 bits) (1 seconds)
4294967296 (32 bits) (2 seconds)
8589934592 (33 bits) (5 seconds)
17179869184 (34 bits) (10 seconds)
34359738368 (35 bits) (19 seconds)
68719476736 (36 bits) (39 seconds)
137438953472 (37 bits) (77 seconds)
274877906944 (38 bits) (155 seconds)
549755813888 (39 bits) (310 seconds)
1099511627776 (40 bits) (619 seconds)
2199023255552 (41 bits) (1239 seconds)
4398046511104 (42 bits) (2485 seconds)
8796093022208 (43 bits) (4963 seconds)
17592186044416 (44 bits) (9914 seconds)
35184372088832 (45 bits) (19812 seconds)
70368744177664 (46 bits) (39610 seconds)
140737488355328 (47 bits) (79361 seconds)
281474976710656 (48 bits) (158542 seconds)
^C

@varunsonkar - sorry i looked at your reply only now. yes internal variables like sv_jobidnumber exist today since we increment the jobid inside pbs server. However, if we loaded the next id from a database sequence we could easily read that large number as a string…so internally we never deal with the ID as a number at all. If and when we move towards a multiple server approach, we would need to get the job id sequence etc from the database anyway instead of incrementing inside the server.

Anyways for the time being, loading the sv_jobidnumber from database into a int64_t would be fine. The postgres field corresponding to sv_jobidnumber is defined as integer, which can take only this much:

integer 4 bytes typical choice for integer -2147483648 to +2147483647

https://www.postgresql.org/docs/9.5/static/datatype-numeric.html#DATATYPE-INT

So, along with changing the variable in C we will need to change the database column type to bigint. And this would also need an alter statement in case of upgrades.

Hi @subhasisb,
Thanks for the reply.
Yes we will be modifying the database column type to “BIGINT”. Also we will handle the cases as you mentioned like upgrade.

Thanks Varun.

I think we do not need to validate the implementation here. The only real question I saw in this discussion was whether we need to support 1 trillion jobs for a 32 bit build as well, and I believe I heard the answer as “yes”. I myself believe we need to support for the case of Windows where we build PBS in 32 bit mode currently.

As far as implementation of how to handle (hold) a 64 bit value properly, that should be adequately reviewed during code review.

Given the above, I sign off on the design.

1 Like

Hi All,
I have updated the EDD mentioning the limitation on the length of “job name” which gets affected with this implementation. Please review the updated EDD and provide the comments/signoff.

Hi @mkaro,
As per our discussion we will go with “unsigned int64_t” approach.
Please have a look at the EDD and provide the signoff/comment.

@varunsonkar: Please use uint64_t as opposed to “unsigned in64_t”. Otherwise, that sounds fine.

Thanks @mkaro,
We will use “uint64_t” while implementing.

EDD looks good to me. I sign off.

At the PBS Pro User Group meeting I got face to face feedback of a strong desire to not have to deal with job ids of longer than ~7 digits. Users and admins regularly have to speak the job ID numbers out loud and it is bothersome when they get long. The customer in question has flipped the job ID counter multiple times in the past few years and basically wants the ability to preserve existing behavior. Can we add the max_sequence_id interface from v1 of the EDD back in (updated with the direction of the discussion around uint64_t)?

That looks like a good interface that meets the customer need. That said, I don’t see a discussion here of why it was dropped. Was there a specific reason?

Hi folks. Have there been any further developments or discussions on this issue? For sites that don’t want to use larger than 7 digits, nothing would stop them from flipping the counter back when they desire. For those who require growth past 9999999 jobs, seeing this issue revived would be quite helpful.

we are planning to start the implementation for this requirement.
we are seeing that there are some suggestions for max_sequence_id which was mentioned in the v1 of the EDD.
So let’s start the discussion and finalize the design.
@scc,
could you please provide more feedback on the requirement on max_sequence_id if any ?

Hello @Bhagat, rather than something like an explicit max_sequence_id to arbitrarily set a max, I believe it would be acceptable for the behavior to be configurable in a binary way: either limit to the current 7 digits max or allow up to 1 trillion (at least).

This should reduce testing burden but still provide the behavior desired by some sites (who regularly speak job ID number out loud and don’t want them to get any longer). We’d still have to consider what happens if the new behavior is in use, the sequence number is greater than 10 million, and the switch is flipped to use the 10 million limit. I’d propose that the next job ID issued after that be 0.

The formatting change proposed in the EDD is right and necessary, but may need adjustment if the actual limit goes beyond 12 digits.

Hi @scc, all, few thoughts:

  1. I feel the older interface does just fine. I think the binary switch might be just too limiting. For sites who might fit their jobs in a slightly longer jobid format, but do not necessarily want to go all the way till max, they would have the option to specify the number of digits (with the older interface). The amount of testing is almost the same.

  2. The requirement does not seem to specify that we absolutely need numbers and sequentially incrementing ones. So question to the community? Do users really care about sequentially incrementing numbers as job-id? If not, there could be several other ways to encode a trillion ids into a smaller set of digits. We could, for example, consider characters a-z as part of the character set to use in a job id (in the id part). We could generate 36^9 unique ids with 9 char-wide ids (and 26 chars + 10 digits)

With regard to point 2, I vote for keeping the job id numeric. A bunch of our scripts will break if we start getting letters.

And, as mkaro said, otherwise, you can run afoul (no pun intended) of vulgar words appearing in your job ids.

Current value of job ID field width (PBS_MAXSEQNUM): 7 characters

Representing 40 bits (numbers slightly exceeding 1 trillion) in various bases:
Decimal: 13
Hexadecimal: 10 (plus 2 if you prefix with “0x”)
Trigesimal (base 30): 9 (think of this as all digits and lower case consonants)
Hexatrigesimal (base 36): 8 (all digits and all letters)
base 96: 7
base 128: 6
base 256: 5

One may conclude that if we want to maintain the current field width we would have to use base 128 with a one character prefix or base 256 with a two character prefix. Even with base 128, good luck finding 128 printable ASCII characters that might work.

I think the real question is what the new setting for PBS_MAXSEQNUM will be? I think we should increase it to 13 and continue to use decimal representation. We’ll need to make appropriate adjustments to qstat, the PBS Pro database schema, and probably other areas. I also think the time has come to ignore the 80 character line width limit. It’s been a long time since I used a WYSE or VT-100 terminal. I think the new “sane” limit should be 100 characters per line.

Increasing the width is no problem, was done before for another branch, so should just work. In that case:

  1. We can keep the default as current (7 digits).
  2. Allow PBS_MAXSEQNUM to be extended as required by a site (to the allowed max)

I was going a bit beyond and trying to figure out whether we need to absolutely have numbers, and even sequential numbers? Thanks @dtalcott for your opinion. I will add a bit more to my question (assuming we need numbers):

a. Do these numbers have to be incremental? I.e. if we get jobid 100 first and then get jobid 2 after that, would that be a problem for users? (The reason I ask is that as we go towards multi-server, it might be easier and less contentious, lock-wise, to dish our separate blocks of job-ids per server) - so do we care?