Adding a new member to batch_status structure

All,

A design proposal for adding a new member to batch_status structure has been created. Its main purpose is to fetch a server’s socket descriptor corresponding to a sharded object like job/node. This new member is especially useful in case of Multi-Server only. Requesting the community feedback for the same.

In the proposal, you call this new member “sock”, but isn’t it really a “virtual stream connection” as returned by pbs_connect(). I think the name “sock” could be confusing.

Thanks for writing the design @suresht

I have a doubt about your design.
You are making this so you have two different types of socket descriptors and the user of the IFL command needs to know which type to pass. Before we called pbs_connect(), got a socket descriptor, and used it. Now for the stat calls, you use the return from pbs_connect(). For non-sharded objects, I assume you still use that socket descriptor. For sharded objects, you need to use the socket descriptor of the server who owns the object. This is putting a large burden on the user of the API to know the internals of PBS.

You’ve taken a very easy to use API of pbs_connect(), IFL, IFL, IFL, pbs_disconnect(), to the user of the API needing to know which objects are sharded and which aren’t. If we ever want to shard another object, the user of the API would have to change how they use it once again.

This should be handled by PBS itself. Either have the servers forward the IFL request along to the right server, or have some sort of IFL daemon that sits in front of the IFL calls who knows who the forward the request to.

Bhroam

1 Like

I agree with @bhroam - we should not need this at all. The end-user should not be burdened to know which object resides where, whether the object is sharded or not. That is all implementation detail inside PBS server.

IFL API users should simply use the virtual socket in most cases. However, in the case of scheduler for example, we need the scheduler to take a decision on which specific server to send a command to (IFL does-not/cannot have this information) - thus we need a way for scheduler to point to a specific server instance.

This means we need a way to get the fd for a specific server-instance, so we can add a new IFL API, that returns this fd and that can be used by the end-user.

So, i do not see why you need a “sock” member inside the batch status structure. For objects like jobs/resv/nodes that are sharded, we could simply add an attribute to that object called “server_instance_id” (or something similar) and populate with an index number or server:port name. That would be consistent with how batch_structure works, an id + an array of attributes - won’t need to add another member variable there

How about the following?

  1. For regular clients (and in general) we always use the virtual socket fd - IFL internally figures out whom to talk to, either all the servers in some random fashion, or via some cache (yet unimplemented), or by knowing something from the jobid.

  2. Some clients (like the scheduler) might want to talk to the owning server directly for performance reasons. For example, it would be wasteful to talk to all the server instances to do a pbs_sigjob(jobid) when the scheduler already knows where the object resides based on data returned from the status calls earlier.
    a) Granted, we can build a cache inside IFL that keeps track of where an object resides, but that is duplicate information on top of what persistent scheduler will keep, and also will need to be updated as objects move between server instances. Also, for each IFL call, we will have to do a avltree lookup. If we are doing thousands of API calls per second, this is thousands of avltree look ups per second, when we already can have this information in the object structure. But yes, that is one way to transparently handle this.

    b) We can add the “owner_info” information as an attribute of the object, in case the object is sharded. If the client wants fast access, the client can use this information to get the specific server fd from IFL. If this attribute is not populated (or NULL), IFL will return the virtual fd anyway. This can be a simple macro to just return the integer typecasted value of “owner_info” (basically a socket fd internally), so would be very fast anyway. So user’s will not need to have this information, rather then would simply call this API for all objects anyway, for all API that deal with an object.

Thoughts?

1 Like

Thank you @subhasisb. I prefer b).

Thank you all for the feedback. I have modified the design document to reflect the new approach. Requesting the community to take a look and provide further comments or approval.

Hi Suresh, some comments:

  • Can you please list all the IFL calls which will add the new attribute to the batch_status?
  • Can you please explain what will happen if client specifies a fixed set of attributes to be queried from the server? Will IFL still add this attribute to the list?
  • I think you should change the title of this thread to say that something generic, like IFL will add server information in batch_status, since it’s no longer a field.

I have modified the design document to contain the list of IFLs that return this attribute when queried. Also adding it here for the reference.

  • pbs_statvnode
  • pbs_statnode
  • pbs_stathost
  • pbs_statjob
  • pbs_selstat

The list of IFLs that I have mentioned earlier only adds this attribute when either the full stats are queried or this attribute in specific is queried. In all other cases this attribute is not listed.
Case-1: When all stats are queried. For example
[root@stblr3 pbspro]# pbsnodes -av
stblr3
Mom = stblr3
Port = 15002
pbs_version = 20.0.0
ntype = PBS
state = free
pcpus = 4
resources_available.arch = linux
resources_available.host = stblr3
resources_available.mem = 7897412kb
resources_available.ncpus = 4
resources_available.vnode = stblr3
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Fri Nov 13 17:05:04 2020
owning_server_info = 3

Case-2: When a set of attributes are queried
[root@stblr3 pbspro]# qmgr -c “l n stblr3 Mom,Port”
Node stblr3
Port = 15002
Mom = stblr3

Case-3: When this attribute is queried specifically
[root@stblr3 pbspro]# qmgr -c “l n stblr3 Mom,Port, owning_server_info”
Node stblr3
owning_server_info = 3
Port = 15002
Mom = stblr3

Making this as consistent with the way other attributes are queried.

I have changed the title of the design document to “Introducing an attribute owning_server_info for sharded objects” but I could not find a way/edit option to change the title of this post in the forum. Please let me know if there is any way to do it.

Added a few more information and examples to the design document based on the feedback received. Requesting the community to provide further feedback.

Hey,
A few comments:
First, I don’t think you need to hide a socket ID in the batch_status attributes. There will be a small number of servers, if there is a way to query the sockets and the servers that go with them, the scheduler can do the association itself. Just add a client side query to return back an array which has the key-value pair of server name to socket. The objects already have the server name as an attribute, the scheduler can just match it.

Second, the scheduler doesn’t care about pbs_sigjob(), I’d swap all references from sigjob to alterjob in your document

Third, this is probably the biggest problem: pbs_preemption() is given a list of jobs and that list is shipped to the server for it to preempt them. Now that list of jobs will be spread across different servers. I think you’re going to have to have pbs_preemption() figure out which server to send them to, and then send off N pbs_preemption() calls. It will then accumulate the return values.

Bhroam

Thanks @bhroam for the feedback. Please find my responses given below.

The idea is not to change the definition of batch_status structure. It is heavily used by all IFLs to return the response to a batch request. There were some concerns to add a direct member to this structure and hence we chose this approach. Also owning_server_info needs to be returned only for stat related IFLs like pbs_statjob, pbs_selstat, pbs_statvnode.

On scheduler doing the association of a job→server or a node→server, in fact we have actually tried a similar approach while doing our POC. Following are our learnings.

  • Let’s say we have a query that returns an array whose elements are key-value pair of server name to socket. Then scheduler has to compare every job/node with the contents of this array to actually find out the right server’s owning_server_info and store it in its internal job_info/node_info structures. So the time complexity is O(n) where n is proportional to the number of servers. If the job volume is very high lets say 100,000 jobs in a cycle then sched_cycle length is extended by O(n*k) time where k = number of jobs which is 100,000 in this case.

  • The other approach is to maintain an AVL tree after analyzing the query that returns an array whose elements are key-value pair of server name to socket. Constructing and maintaining an AVL tree might also turn to be costly especially if the number of servers are big plus we have to deal with string comparisons as the key is a string.

If we follow the current approach we can avoid both of the above and also achieve the desired goal. In this approach
Time complexity of adding owning_server_info attr = O(1) as this is added as a first attribute.
Time complexity for sched to populate job_info/node_info with this attribute’s value = O(k) instead of O(n*k) with the other approach where n is number of servers and k is a number of jobs

I have swapped pbs_sigjob() references with pbs_asyalterjob(). Please check the design document.

@agrawalravi90 is coming up with an approach to handle preempted jobs and add more details on it shortly.

We are thinking of handling this on the server’s side itself: one server receives all job ids, picks out its ids and broadcasts the rest, goes about its business while others pick out their jobs from the list and preempt them and reply to the server. Once server has received response from all, it’ll reply back to client. In this scenario, the IFL doesn’t change much, clients don’t do anything special, server does most of the work.

Discussing this more with the team, it might be easier to just broadcast the preemption request to all servers, who will pick out their own jobs and preempt them, IFL will collate the replies and send it back to scheduler. Let me know what you guys think.

Listing the cons and pros of each approach we were discussing:

Approach1: Adding a new attribute

  • Pros: not modifying batch_status

  • Cons: Clients are getting an additional field, which the consumers of the client output may not care. For instance, it will get displayed with qstat -f etc unless we suppress it.

  • Clients will not receive this attribute with a partial status - when asks for a specific list of attributes.

Approach2: Adding PBS_Server attribute in server itself for shared objects.

  • Pros: It will be useful for even the consumers of client output to know who will be the owner of the server is.
  • Cons: Scheduler might have to loop through all jobs, nodes doing string comparison.

Approach3: Adding a member to batch_status.

  • Pros: Efficient
  • Other clients/IFL who needs to know the owner from batch status can do so without looping through all attributes.

We can take approach #2 if the performance penalty is not bad.

We can avoid the loop in approach-1 also if we can add this attribute as first one in the attribute list