Adding a new member to batch_status structure

Yes, clients no need to worry of this attribute as this is read-only and the contents of it are totally opaque i.e. clients do not need to interpret this attribute. If required we can actually suppress it in qstat -f output.

I disagree that it will be opaque to clients. If a client does a pbs_nodestat and wants to only get 1 attribute, they’ll still get 2 attributes instead. So, this might break existing end user programs.

The performance penalty of going over a loop of a handful of servers should be insignificant. I did a test to see how long 5 million string compares will take, it took 30 milliseconds on my system. If scheduler has 1 million jobs in the queue with 5 servers (totaling 5 million string compares), 30 milliseconds will not make a difference, the whole of scheduling cycle will take orders of magnitude more. Also, with diff-stat we should never send 1 million jobs to sched, so the performance penalty should be a non-issue and we should do what makes the most sense and is the most robust.

I mean to say the content of the attribute owning_server_info is opaque but not the attribute itself. If the client queries or wants to get a single specific attribute then they only get one attribute that they asked for, but they do not get this new attribute in addition. This attribute is returned only when somebody does a full job/node stat. Please look into the examples that I have added earlier in this thread. Also, this attribute is read-only.

Thanks, Ravi. In the example you are taken if we also include the string comparisons needed for nodes then it will double up i.e. 60 ms approximately. If gaining 60 ms does not matter then we can just do this computation inside the scheduler itself. However, It is not just the scheduler alone to compute this, but all the clients similar to scheduler have to do this computation.
I am open to both the approaches i.e. IFL populating owning_server_info or scheduler like clients deriving it if the said performance gain does not matter. Just would like to hear back from all others so that I can stick to one approach and proceed.

The scheduler was also querying a select few attributes a while back, it might do it again. You can’t design for one client behaving a certain way today.

it will be double only if we have 1 million nodes in the system, not realistic. But even if it’s 60 ms, it won’t matter. If we are transferring 1 million node and 1 million jobs to the scheduler, the query time itself will be 10s of seconds. So tens of milliseconds is nothing. Same applies for any other client querying 1 million jobs/nodes.

@suresht
When making decisions for performance reasons, you need to test. Just because something looks like it’ll be slow doesn’t necessarily mean it will be. Querying 1M jobs and 50k nodes takes in the 10s of seconds. Adding 30-60ms to that is insignificant. Also with the persistent scheduler using diffstat, N drops to a much smaller number.

This did bring up a possible issue. When one of the N server’s connections goes down and comes back, the scheduler will need to reread the universe since the socket will change for that server’s jobs.

As a note, the scheduler does request just the attributes it needs. If you were planning on not reporting it in that case, the scheduler won’t get it. If we do go this route, I would definitely not report it in qstat -f or qstat -Fjson.

As a note, unless we document that the first attribute will always be the socket, the scheduler will need to put it in the loop of attributes. This loop goes over the attributes returned and does a strcmp() for each known attribute. This will mean we’ll do more string comparisons than the number of servers.
Bhroam

I think the existing job attribute called ATTR_server actually stands to designate a CLUSTER. Here what we are interested in, is the server INSTANCE. I think (we could try this out) that ATTR_server is used in case of qmove and such (between clusters) - so we anyway need a different attribute to tell clients that an object belongs to (or came from) a particular server instance!

In the multiserver architecture, the IFL layer is anyway “manipulating” data received from the server instances (take for example the “aggregate” job count that it will return in case of a stat request)- hence i do not see why it would be wrong to populate a similar attribute at the IFL layer.

I do feel that the proposed attribute’s name is a bit confusing. Instead of “owning_server_info” i would suggest we name it something simpler, say “server_instance_id” (the key point here being the difference between server and server-instance). In PBS the current word “server” means “cluster” and so we need that differentiation!

@subhasisb, thanks very much for the feedback. I will rename the attribute from owning_server_info to “server_instance_id”.

Thanks @bhroam. As we discussed performance is not only the reason. We anyways need a new attribute to represent server instance. Please look into Subhasis’s reply for more details on this.

When one of the servers goes down, Scheduler closes all server connections since it internally calls pbs_disconnect. This is pbs_disconnect’s behavior as of today. So we need to reread the universe if a such a thing happens. As server going down is very rare event I think this would be okay.

When a new attribute is added and scheduler is interested in it, scheduler has to request this attribute as usual. But we can always make the new attribute as the first one among the attributes that are requested which allows us to avoid the loop. As you said, we can add a note in our design document.

Hi,

As long as the client gets the data that they requested, and only that, it seems like the 2 approaches (server sending an id to client vs client IFL adding an id) will functionally behave the same way. I do prefer the server sending it over the IFL doing it simply because it seems like a more maintainable solution. If tomorrow we want to change this attribute to contain anything other than an fd, then IFL will not be capable of populating this field. Also, why do something special if it doesn’t really benefit us much? Use the standard way if it works.

Just putting one more point towards the current proposal which I forgot to put it earlier.

As both the approaches are functionally behaves exactly the same, having socket fd as part of server_instance_id vs having server instance name as part of this attribute has the following advantage.

If we have server instance name then Scheduler upon querying this attribute for every job and node, it has to go through all the list of configured servers to actually find the socket fd corresponding to the server_instance name. This calculation is needed for all other scheduler like clients. In other words the code for this calculation is duplicated across all the scheduler like clients.

Instead, if we have socket fd as server_instance_id then scheduler like clients directly get it just by calling node/job stat IFL i.e. no need for them to repeat this calculation. This way we have one central place for this calculation which is IFL layer i.e. in this sense it is abstracted also. Code maintenance is easy here.

And the other benefit is since the server_instance_id is opaque we can actually change its content to something else easily in future base on the need.

We can just create a common function which can be called by any client to get the server fd, why would there be any code duplication?

Not arguing against that, please call it whatever you like. Since it’s a generic name, it makes all the more sense that we abstract it out and create an interface which accepts the attribute as input and returns the server fd as output. There’s way more flexibility in what value we put inside this attribute if we have the server send it to the client.

Let’s do the standard thing first. If it later turns out to be a problem, then we can create a specialized solution for it.

Regardless if we have the socket passed along, we need the name. When a job is being run on a node that isn’t owned by the same server, the pbs_runjob() extend field is set to the name of the server owning the mother superior. This means we need the name.

If we already are passing the owning server name, should we also pass the socket? If the name can be turned into the socket, maybe we should only pass one.

From what I understand the server name that owns a node or a job is going to be set by the owning server, so that makes @agrawalravi90 happy.

Bhroam

Yes we can create a function to get the server fd which needs to go to a common library which is accessibly to all the clients. Instead, I was thinking if it is done inside IFL layer then we could have avoided this function.

Well even if it is server fd, our proposal is to have a macro like function to access it even if it returns the same value for the time being. This is because we can change the definition of it easily in the future if it is a macro/function.

@bhroam, we can derive server name and port given a socket fd. All we need to do is to make use of the function get_conn_svr_instances(). As of today pbs_runjob only needs server name and port of destination server, all other APIs just need the socket fd.

As of now we don’t have ATTR_Server for node objects. It is only present for job objects. Moreover, we are not supposed to use this attribute to refer to a server instance, it is because ATTR_server actually means ATTR_cluster.

Then how we can create nodes/vnodes as of today even when the above mentioned attribute is not present is given below. @nithinj, please add/modify the following If I am missing anything.

  1. Natural nodes can be created using qmgr.
    PBS_SERVER_INSTANCES=<server_instance details> qmgr -c “c n stblr3”

  2. vnodes can be created using version 2 MOM configuration file. Since a MOM only talks to one server(which is configured), it creates vnodes from the information sent by MOM in the UPDATE2 message.

Thanks for all the feedback. As both the approaches are technically and functionally achieve the desired goal, In the interest of time and since some of us are in favor of the approach where we store server instance name as part of server_instance_id, I would like to go ahead with this approach and proceed further.

I will update the design document accordingly. We learn more when we implement this and in case if there is any need to go back to socket fd approach for any valid reasons we can always do that as the expected changes are minimal.

All,

I have updated the design document to reflect the above changes. Also changed the title of the design doc to “Introducing an attribute server_instance_id for sharded objects”. Please take a look.

@suresht thanks for updating the doc, just a couple of comments:

  1. Since now there’s no change to the batch_status, you can remove its definition (the struct with members) from the document
  2. please mention permissions for this new attribute, you’ve already mentioned read only, please also mention whether it will only be admin readable or will normal users be able to read it as well.