Jobid namespace resolution for multi-server

@nithinj Thanks! you are correct, the main purpose of this feature is to ensure that job ids don’t collide, the client side stuff is just optimization, so I’ve reworded the design and added an additional FAQ to clarify that the jobid won’t definitely allow a client to locate a job’s server. Please let me know what you think.

@bhroam Good question, I hadn’t thought about that. Instead of having admins bump up the max job id number, what if we handle it internally so that the 2 extra digits aren’t counted towards the limit? So, jobids will appear longer than max_sequence_id, but we could document that for multi-server it will be minus the 2 LSB. Or do you think it might be better to stick to the limit and the admins can bump up the max_sequence_id if needed?

1 Like

I am also coming into this late, so apologies for that. Here are my thoughts on this:

  • IDs should be opaque strings. No information should be encoded in them. They should have one, and only one, function: to provide a unique handle for referencing the job. This is considered pretty standard best practice at this point. Having worked with systems that violated this principle, I can assure you it was very painful.

  • Opportunities to make these kinds of significant changes are few and far between. We should make a choice that gives us maximum flexibility. It should not in any way make assumptions about or limit how many servers are running, the maximum number you can run, adding or removing servers, etc…

  • Users will likely complain if we go to something other than strictly increasing integers, but it is human tendency to react negatively to change. I do not believe it will inherently impact that ability of jobs to be run. It might break some user scripts that didn’t treat it as an opaque string, and it might take some getting used to, but I see no real impact on production.

  • I believe someone (BillN?) suggested having a configurable option to point at a service that can give strictly increasing integer IDs. Having that as an option is a nice compromise for sites that really want that feature and can accept the failure and performance consequences of that (which is probably most sites).

  • I would suggest something along the lines of <opaque hex string>-<counter> for each server. When the server comes up it generates a unique ID (the <opaque hex string>) and then starts an internal counter local to that server. Ideally, the counter would be zero padded so the ID is a consistent size. Generating unique IDs can be expensive, so this way that hit is only incurred at startup and thereafter, you are just incrementing a counter. It is fast, unique, scalable, and makes no assumptions about the state of anything else in the cluster. There could be foo-10 and bar-10 at the same time if you were running in multi-server mode, but I don’t think that is terribly confusing. You can also have two integer IDs that end in 10. This algorithm also has the advantage that if all your jobs come from a single server (or there is only one), you basically do get strictly increasing integer ids after the dash, with the caveat that if you restart the server, the integer part is going to start over. We could add an option to tell it where to start counting from. For clarity, here is an example:

  • server starts up and generates foo as its unique id
  • The first job comes in and gets foo-00000001 as an id
  • The 2nd job comes in and gets foo-00000002 as an id
  • The server process restarts and generates bar as its unique id
  • The next job that comes in has bar-0000001 as an id
  • The next job that comes in has bar-000002 as an id
  • The server is restarted again, the id is baz, but the admin passes an option and tells it to start at 100
  • The next job has baz-00000100 as an id
  • The next job has baz-00000101 as an id

The above example had this happening sequentially, but that could also have been three different servers instances started simultaneously. If you ever stayed up long enough that you got to 99999999 on the counter, then you generate a new unique id and start again at the beginning of the sequence.

1 Like

Hi Bill, thanks for chiming in and taking the time to express your views in such detail. Here are my thoughts:

I see where you are coming from, but PBS job ids have always had the server information encoded in them: <sequence number>.<server name>. All client commands use this to find the correct server when a site has multiple PBS complexes configured. The main purpose of the encoded multi-server ‘id’ in the proposed algorithm is indeed to generate unique jobids. It’s just that IFL could also leverage it to make a good guess as to where the job resides, very similar to what it does today with the server name part. The idea is not to guarantee job location via the id, but to allow the code to not have to broadcast each job request to multiple servers whenever possible. In case the job is not found on the expected server, IFL will broadcast it to all servers. These are just internals of how we might use the id. Admins will not be advised to make any assumptions about it. I can remove the mention of what exactly PBS will do with the reserved bits from the doc to ensure that. Does that address your concern?

We do need to fix the max length of the job id, which does restrict how many bits we can reserve for the server id/hash string, and the max number of servers that we can have. @subhasisb felt that 1 character might be good enough (10 servers max). @billnitzberg suggested reserving 2 (100 servers max) and that’s what I went with. We can always increase it if desired, it will actually not be a big change IMO, we recently bumped up the max limit of a jobid’s sequence number, bumping up the characters reserved for server id should be similar. Until then, we can choose an initial max which makes sense without bloating up the jobid too much.

I think the algorithm that we were thinking of is actually similar to this: <counter><opaque decimal string>.<cluster name>. I’m guessing we’ll need <cluster name> regardless of what we choose (and if not then that’s a separate discussion than this). So the main difference is a hex string vs a decimal string, does that really matter? Keeping it a decimal makes the jobids look similar to how they do today, which might be nice since there are probably a bunch of user scripts/front end tools which parse job ids which won’t need any change to accommodate this.

Please let me know what you think. This is still very much in design phase so we can go in a different direction if needed.

I think Bill makes a great point with:

Multiserver brings a huge advantage in speed & reliability – I think folks will be willing to trade some amount of change/pain for that advantage… as long as we support backward compatibility for a reasonable configuration (e.g., the single server case for multi-server or a plug-in that allows the site to generate backward compatible IDs if they want).

In particular, we should revisit (a) using increasing numbers, (b) using only numbers, (c) requiring the cluster name, (d) having a (too small) fixed size ID, etc.

Perhaps we could do something like how Docker generates unique IDs (or some combination as Bill suggests, so expensive “unique” IDs don’t have to be generated too often)?

If we stick with only numbers, even if we eliminate the sequencing, we’ll have lost an opportunity that doesn’t come along very often.

Just saying…

Ok, I see your point and totally agree that we have an opportunity to do whatever will be the most useful in the long term. But I do think that we should do something disruptive and possibly less human friendly only if it adds considerable value. I can’t really think of an approach which would have a lot more merit than the simpler solution, but maybe I have tunnel vision. I’m not convinced that using hexes over numerics adds much value. Other ideas/thoughts?

First, it is actually worse than you think. I said hex and I really meant alphanumeric. I believe there is considerable merit in not limiting ourselves to digits, but clearly this is arguable either way. The reason I don’t want to limit it to digits is the representation space. If you limit the number of “digits” identifying a server to 2 (using your proposal as an example) you can represent 100 servers with digits only, 256 with hex, 36^2 or 1296 if you use digits and one case, and 62^2 or 3844 if you use digits, uppercase and lowercase.

I think where we are differing is in our assessment of:

  • the number of servers we want to plan for
  • the perceived impact of moving away from just digits
  • perhaps the time horizon we are designing for

Time horizon: I figure you get to make a change like this once every 20 years or so, so I am trying to choose something that will work for 20 years. Someone already mentioned Bill Gates and the 640KB of RAM scenario. How painful was that? And… maybe I am showing my age :slight_smile:

Number of servers: I envision a scenario where servers are dynamic. They come and go. If you take the scenario to its logical extreme, there could be a server for every job. That certainly means there would be more than 100. Maybe more than 100 running simultaneously. Maybe that will happen, maybe it won’t, but it certainly could (BTW, the Flux scheduler folks at LLNL have the same idea). I just want to make sure we don’t limit ourselves in that respect. If you are willing to designate a 64 bit number to represent the number of servers, then I am fine with it being numeric.

Oh, something else just occurred to me that probably was not obvious. I want this to be unique for all time. Every time a server starts, it gets a new ID and that ID can never have been used before in the history of that facility. If you use one of the standard UUID algorithms it should be globally unique, but I dont think that is necessarily required.

I will acknowledge that how we generate that unique ID will be a discussion in and of itself. There are a LOT of alternatives all with pros and cons.

The impact of going to alphanumeric: Yes, some scripts might break. However, since there are no guarantees about being consecutive, other users can cut in, etc., doing math on it really doesn’t make much sense, so it should be, and probably is, being treated as a string. Other than “looking weird” to humans, the biggest impact would probably be in databases. If they are storing this data in a database, they might have it stored as an int, or if it is a string, they might have to make the max string length longer. Asking them to do that once every 20 years to get a massive scalability boost seems reasonable to me, but certainly that point can be argued. Oh, and we could choose to do what git does and only display n digits as long as it is unique to cut down on the “looking weird” aspect.

Using the server name in the ID as a heuristic for contacting the right server was mentioned above. I feel a little bit dirty saying this, since I am breaking my own rule about not encoding data in an ID, but the alphanumeric portion (in my proposal) basically is a unique identifier for the server. When it comes up and generates that unique string, it could broadcast that to the other servers and you now have a lookup table that you could use for that heuristic. Basically a routing table.

I scanned back through the thread and I think I responded to all the points that were brought up. I may or may not have convinced you of course :slight_smile: If I missed something or something is not clear, please let me know.

Thanks for providing more details Bill!

Ok, It sounds like there is at least 1 clear benefit of using alphanumeric over digits: being able to pack more number of servers in the same bits. I think we could actually take it one step further and say that this can be any of the 127 ASCII characters other than a dot because we use that as a delimiter for the cluster name, then we could fit 127^2 or 16k+ servers in 2 characters.

But, we were debating whether a site will ever need more than 10 PBS servers not long back :slight_smile: So, I hope you can understand why I’m a bit hesitant to see that as an important differentiator. I might again be suffering from tunnel vision, but the scenario where we have dynamic servers coming and going, or we have 1 server for each job, doesn’t really sound like these are PBS servers, at least nothing resembling the PBS server as it exists today. I totally see your point that we should not close the door on such possibilities, but I also feel that if we ever go in directions like these, we would be changing so much of PBS as it exists today that we would probably need to take up jobid redesign afresh. It might actually not be useful to add a server id to jobids at all if servers come and go, or if there’s a new server spun up for each job. But maybe I’m wrong, so I’d really like others to chime in and tell me if I’m being too short-sighted. But I totally agree with you that if it’s indeed important to design for thousands of servers then using alphanumeric would make a lot more sense.

Thanks for bringing this up, I hadn’t thought about it. I think we’ll need to reserve more than just 2 characters if we want to fulfill this requirement (in any approach). I do want to point out that PBS doesn’t support this today, jobids go back to 0 after a point, and I think the simplistic solution will behave the same, server restarts won’t cause jobids to be duplicated, but they’ll go back to 0 after the max. Maybe we can focus on the “number of servers” topic first? I think this might deserve a separate discussion.

Could you provide a use-case or two for these? These are radical changes and should provide a compelling, real-life benefit.

Happy New Year everyone. I hope everyone had a wonderful and restful holiday.

There are a number of open discussion points, so I am going to start with the big picture and then drill down.

First , I want to stress that we have all agreed that we should, and can, make this change invisible in a single server environment. If you have no need of the scalability gains multi-server brings you, which at least in the near term will be most people, then this is not an issue. If you do need the scalability, you should expect some disruption.

I believe the base difference is that we have a difference in focus. I am looking 20 years down the road. I am looking beyond Exascale. I am looking at billions of schedulable entities and 100s of millions of jobs. I am trying to fix this problem once and for all, and yes that will likely be more disruptive. Given that these changes only affect multi-server which is all about scalability, scalability is where I think the focus should be. My interpretation of the objections that have been raised are that they focus on minimizing disruption now, at the expense of multiple disruptions in the future.

I was asked to provide use cases that justify unique identifiers. I have already given some, but I will consolidate and add some more.

  • The most basic is that jobIDs are supposed to be unique identifiers. If you ever roll it over and reuse an ID you are almost certainly going to break the facilities database since they likely will have marked the JobID as unique, a primary key, or both. PBS has already had to extend the length of the jobID, presumably for that reason. If we do this now, we should never have to change the jobID again.

  • Right now, at least on machines of the scale that we run, HTC jobs are problematic. The most common solution to that problem is to use some sort of tool that runs itself as a single job as far as the scheduler is concerned, but then it runs many smaller jobs on those nodes. That is sub-optimal in a number of ways, not the least of which is that it prevents the scheduler from optimizing the workload since it can’t see a significant fraction of the workload. This is one of the reasons I foresee many servers with relatively short lifetimes. They would fulfill the same function. If the “master scheduler” saw that x nodes would be available for y minutes because it was draining for a larger job, it could start a “sub-scheduler” that only saw those x nodes and could run jobs on it. This distributes the load dynamically. You can also optimize the scheduling algorithm for each server. This is not unlike the two level scheduling that Mesos does. The “master scheduler” only cares about provisioning resources, making sure the nodes are used, it then delegates the running of jobs on those resources to other schedulers.

  • If we go all the way and make the ID globally unique, it opens the door to some options. In theory, I can pass this job off to any other PBS server… anywhere in the world. For instance, we work closely with the Advanced Photon Source (APS - for the purposes of this discussion, call it a giant Xray machine). The APS generates data. They compute on some of it, some of it they send to us. We currently are, and I suspect always will be, in separate administrative and security domains, but in theory, they could send a job over to us and there would be no issues with the ID. There are lots of other issues to solve to make something like that happen, but at least the IDs are not part of the problem. Also, if you use the routing table idea I mention below, it is also trivial to track where the job originated and get results back there. I don’t see that happening today, but it could.

Now I will address some more specific items:

  • Ravi suggested that we might not need to add server IDs to the jobID. I think we should and I believe we can actually further optimize the communication lookup. You could create the equivalent of a routing table. You record the server portion as the contact for that jobID. If it migrates, you update the routing table. This means the only time you should ever have to broadcast is if you queried during the window when a job was migrating, which hopefully should be rare. This also works in the, admittedly future looking, situation where you routed it to a different PBS instance.

  • Ravi said what I was envisioning would require a major rewrite and we could fix it then. Yes, I believe PBS is going to change A LOT to meet the future needs of HPC. That being said, there are advantages to fixing things a bit at a time. If we change the ID now, when it comes to making other changes, we have one less thing we have to worry about.

  • I don’t feel strongly about this, but thought I would explain why I suggested Server ID-JobID. There are two aspects to this: Where the JobID is in the string and what the separator is.
    - I would put the JobID last because most of us expect the right most digits to change the fastest when things count up, and thus would be more comfortable for the users. See, sometimes I do advocate for the users :blush:.
    - I suggested a dash, particularly if they are both integers so that it is clear it is a separator and not a decimal point. However, it was pointed out that many UUID schemes use dashes and thus having a dash as a separator would make parsing difficult. I guess the takeaway is that the separator, whatever it is, should not be a valid character in the ServerID or JobID.
    - Since you have history with JobID.ServerID there is argument for leaving it that way, I just wanted to explain why I suggested what I did.

  • While this is minor, I wanted to point out that you would not want to use all of ASCII for the characters in the ID. Printable minus whitespace and the separator would be my suggestion. That should be about 90 options for each character in the ID. If you use all of ASCII bizarre things will happen, like if you get ASCII 7 in the server name and then do a qstat and get a 1000 lines, your terminal would beep 1000 times since ASCII 7 is the bell. Tabs, carriage return, form feed, line feed, vertical tab, etc. would also have undesirable effects, not to mention having an ASCII 0 NULL in a C string… If we use a standard UUID scheme, this discussion will likely be moot since the character set will likely be specified.

So now, let me push back on the reasons for not doing this:

  • Space: That is not a legitimate argument as far as I am concerned, particularly given the potential advantages. The main reason I say that is that the needs scale with the scheduling load. If you only run a single server and say 100K jobs in a year, that is less than an extra MB of storage. If you are running workloads like I am planning for, you will have the infrastructure to support it. The other reason is simply that storage is cheap.

  • The other reason is change: I agree that the argument here is more subjective and thus less clear cut. But here are my thoughts in that regard:

    • Change is required for multi-server. If we have to change something, lets fix it once and for all.
    • It is going to break more scripts - Maybe; But unless you have a good statistical sampling of all the hooks, ETLS, and individual user scripts out there, this is a matter of conjecture on both our parts. If they are doing what they should be, which is treating it as an opaque string, it shouldn’t break anything or if they put a length limit somewhere, they have to change that, in which case they probably will have to do that either way.
    • Users won’t like it - This is separate from scripts breaking, this is just being uncomfortable with change. Maybe I am not very nice, but I just don’t see that is being a strong reason. “JobIDs have always been numbers” just does not outweigh the potential advantages in my mind.

Summary: Please, let’s fix this once and for all and leave the door open for opportunities and innovation, not kick the can down the road by doing what we have always done.

Happy new year to everybody. What a great discussion!! Let’s keep it going :slight_smile:

I completely agree with these:

  • We need to keep the design flexible to scale to the next set of decades, sure thing!
  • Alphanumeric is great (shorter ids with larger id space)
  • Not limiting number of servers to just a few

I feel we can summarize to these two:

  1. The jobid needs to be a “globally” unique identifier for the job, made up of alphanumerics
  2. The jobid needs be a completely “opaque” string (nothing in the jobid can be used to derive information about which server instance it originates from etc)

If we agree, we can simply allow our ids to roll over to alphabet space (from being just numeric) and retain the jobid.cluster-id semantics.

One issue in doing this is to maintain 2 code paths. One code path that recognizes that pbs is running in single server mode and needs to dish out backward compatible jobids. In multi-server mode we would need to switch code-paths to dish out something different. And, we would much want to have the same code path (to ease maintenance and testing etc)…

So, either we:
a) Make the change for both single and multi-server mode, or…
b) We agree to (and document) an alphanumeric jobid space, but for now and till we need it, dish out only numeric ids…

Thanks for the inputs guys and providing more details.

It’s great that we are thinking about the future of PBS in general, but I’m not confident that we can today design for the future without knowing more details about it. For example:

PBS doesn’t do this today, so that’s a new feature which I think deserves a dedicated discussion. Today, the assumption is that when the jobid rolls over back to 0, job 0 is either long gone or nobody cares about it anymore, or both, so it’s ok to roll back. If that assumption is no longer true, then it should be regardless of multi-server. Maybe it makes sense to have this be configurable so each site can decide if they care? So, I think it needs a dedicated discussion.

This is again something that deserves a dedicated discussion. There might be multiple ways of solving the HTC problem: maybe instead of a master-slave model, a peer to peer model where there’s one scheduler for each permanent multi-server instance, might work better. Or maybe a model with one server and multiple schedulers will work better where jobs are submitted as job arrays (or an evolved form of job arrays) so the server doesn’t get bombarded with a zillion individual jobs and the scalability is needed on the scheduler side instead, not the server. Again, I think we need more details and a dedicated discussion for it.

I think this actually is supported today, you can qmove a job from one PBS cluster to another PBS cluster, anywhere in the world. I believe that the way jobids are kept unique across the 2 clusters is via the ‘cluster name’ (or the ‘server name’ in a non multi-server world) of the jobid, which is unique for each cluster.

I’d like to propose this:

  • I’m going to change my design to say that in a multi-server world, jobids should not be expected to exhibit any particular pattern, so they may not be sequential and they may not be numerical, just treat them as opaque strings.
  • By being vague, we’ll teach users to not rely on the jobid being in any particular format and it will allow us to change it in the future to whatever is needed, when it’s needed.
  • So, if we decide a year from now that jobids need to be alphanumeric, we can do so without any issues, it will just be a code change, no change in external behavior.

So, this will keep the doors open to possibilities in the future while allowing me to solve the short term problem of jobid namespace conflict in a multi-server world. What do you guys think?

That sounds good to me. Let’s wait to hear from others

I feel we can summarize to these two:

  1. The jobid needs to be a “globally” unique identifier for the job, made up of alphanumerics
  2. The jobid needs be a completely “opaque” string (nothing in the jobid can be used to derive information about which server instance it originates from etc)

If we agree, we can simply allow our ids to roll over to alphabet space (from being just numeric) and retain the jobid.cluster-id semantics.

I think the above makes sense and I think we are converging.

I understand the concern about maintaining two code paths, but I am not seeing why there needs to be two code paths? Likely, I am just missing something. I just noticed you are calling it a cluster-id and not a server-id, which in my head have been synonymous. As long as the serverID was unique, you could plug that into the cluster-ID portion and it should just work. Does the cluster need to be identified in the ID? The comm network needs to know what servers are in the cluster so they know who to talk to, but does that need to be in the ID? Here is what I am thinking:

  • The server/cluster ID is defined to be say a 36 byte string (length is negotiable but UUIDs are 36 characters including the hyphens, so that is why I picked 36):
    • It can be anything, as long as it is unique to the facility
    • If it is globally unique it enables some future functionality
  • Likewise, the JobID is defined to be a 32 byte string
    • again, it can be anything as long as it is unique when combined with the server/cluster ID

Given that, I think the algorithm you use today would work:

  • The jobid portion could be exactly what it is today, an incrementing number. It would be stored as a string, and the servers would be creating them independently so you could have 1.server1 and 1.server2, but if we allowed them to specify a starting number for the counter you could start server1 at 10M and server2 at 20M or something to minimize the potential of that happening. It would not matter to the code if the jobID portion are the same, it would just be to avoid confusion for the users.
  • I don’t know how the clusterID is set today, but as long as it isn’t longer than the length of the string we specify (and we could always hash it) and it can be unique for each server, I think that would also work.

What would break this is if there is some logic in the code that requires it to have some ID that represents the entire cluster, the entire group of servers. Is that the case?

We could then provide an alternative, or multiple alternatives, or even make the ID generation code pluggable so the user could put their own algorithm as long as they follow the rules: opaque string, of a specified length, unique to the level they care about. Maybe they only care about their own facility and not being globally unique. Maybe these multiple algorithms are what you meant by multiple code paths? If so, that seems fairly minor to me, but maybe you disagree or I am being naive and missing something?

1 Like

A few comments

  • Yes, there are many ways to address the HTC. That was an example to motivate the idea of many, many servers, that might have dynamic lifetimes, which would also potentially have the same requirement. The master-worker configuration was not what as important, but the fact that a server was dynamically spawned to deal with a subset of the problem.
  • How do you guarantee that the cluster name is globally unique? Does it incorporate a URL or something?
    • answering my own question, it looks like it uses the fully qualified domain name. So I guess the question is, while you call it clusterID, is it really or just a serverID? Is there a need in the jobID to indicate they all came from the same “cluster”?
  • In general, I think your proposal is fine. One question though. Are you going to actually change how they are stored and manipulated to be strings in the code, or are you just going to put verbiage in the design document saying treating them as opaque is a best practice? I would much prefer that we move to the strings sooner rather than later.

AFAIK, before multi-server, a server id was equivalent to cluster id so it didn’t matter. With multi-server, we have a need to differentiate servers which belong to the same cluster vs those which belong to remote clusters. For servers which belong to the same cluster, the code will broadcast various requests when necessary and collate responses to get the “complete” picture of various sharded objects like jobs and nodes. PBS also has features like peer-scheduling which operate on multiple clusters, not the servers in the same cluster. Hence the need to have a cluster id.

For my purposes (avoid jobid conflicts in a multi-server scenario), I don’t need them to be alphanumeric. So, I am planning to just change the design and leave room for them to be converted to strings later, when needed. I think we need to discuss that separately and get opinions from other devs/users. I also think that making them alphanumeric shouldn’t be a multi-server feature, we are just using multi-server as a “switch” to turn this on or off. If we need a switch, we should add one explicitly for it. Anyways, as I said, there’s more to be discussed I think so we should take it up separately.

I’ve made some updates to the design (PBS Pro Confluence), please let me know what you guys think. Thanks!

1 Like

A couple items related to this discussion.

  • There is a desire for a job to have an identifier that is unique through all time. I think the best way to do this is to create a new job attribute that is a UUID, and is distinct from the jobid. A future feature could implement this and allow jobs to be referenced either by jobid or by UUID. For the present, it suffices for the jobid to be unique among all jobs known to a cluster.

  • Currently, the job sequence number is also used to create reservation ids and corresponding queue names. We don’t want changes to the jobid mechanism to result in unwieldy queue names.

1 Like