RunJob API will accept destination server identifier in extend parameter

This is part of multi-server design pages. This design will enable the client to run a job on a node reporting to a different server than the one job resides in. Please provide your feedback.

Thanks Nithin. A few questions:

  • Can you please clarify in the doc which server will the request go to? The one in the extend field, or the one who’s fd has been provided?
  • What will happen if the server name and port are invalid?
  • What will happen if this is done for a single server configuration?
  • What will happen if user only provides a valid server name, not the port number?
  • How will a user know which server a node belongs to?
  • The request will reach the source server where the job resides which then moves the job to the server specified in the extend field of request.
  • If the server name or port invalid, pbs_errno is set to PBSE_BADHOST
  • There will be no change in behaviour for a single server as the APIs can be invoked with the server identifier in extend field or NULL.
  • If the extend field is filled with only the server name without the port number, the default server port number will be attempted.
  • The client can find where the node belongs by either checking which socket returned the node. It can be also identified by checking the server attribute (read only attribute on the node) which will get set during node creation.
    I’ve updated the doc with these details.

Thanks for clarifying Nithin.

" * If the server name or port invalid, pbs_errno is set to PBSE_BADHOST"
" * If the extend field is filled with only the server name without the port number, the default server port number will be attempted."

Can you please add these to the doc? Also, the doc is a bit hard to read, I’d suggest creating bullet points which explain the change in behavior for the runjob API.

Thanks for the suggestion. I’ve updated the doc.

Looks good now, thanks for making the changes

@nithinj thanks for sharing this doc
I disagree with this approach. We have an open API. Anyone can write a program which uses it to interface to PBS. It shouldn’t be up to the client of our API to know what server to send the job to within a local complex. To me this sounds like something PBS should handle on its own. Why can’t the server that owns the job know where to send the job? There is little difference between what you are suggesting and just having the API user calling pbs_movejob() and then a pbs_runjob(). You are also opening an avenue for error. If the API user runs the job on a server who doesn’t own mother superior, then it will fail. If the server handled this, it is less error prone.

Bhroam

Because each server only knows about its own nodes. So, if users send it a runjob request to run on a node which it doesn’t own, it doesn’t know where that node resides.

We will come across the same problem when running multi-node jobs which span across servers. So this might be a good time to discuss how we should handle such cases. There are a couple of options:

  1. Have each server maintain metadata about {node: server} for each node that it doesn’t own. This means that when one does qmgr create node, it’ll go to all servers so that they can update their metadata. Then, servers will be able to do things by themselves without clients telling them anything. The con of this approach is that since servers don’t share a DB, if a server goes down, it will lose all of its metadata. How will it build it back?
  2. Have the clients provide information about the destination server, if it’s different than the owner. This will possibly mean a change to the exec_vnode syntax to include the server host:port before the list of all vnodes being booked for that server. This will be more hassle for the client, but it might be the more robust as a server going down won’t be as big a problem.

What do you guys think?

For the con of approach 1, i guess servers can save the metadata to db so that they get it back when they come back up. We’ll need to reject all node creation requests until all servers are up but other than that it might work.

I think option one is the right way to go. The server needs to keep a cache of node:server pairs. It can start out with an empty cache, and when it receives a request for a node it doesn’t know about, it queries the nodes from the other servers. This could be as simple as doing a pbs_statvnode(), but I suspect there is a more efficient way to do it. Once it has done this query once, it will likely not have to do it again until either another server starts or the server restarts.

The exec_vnode should NOT contain host information. It was specifically created to only contain vnodes. We specifically separated vnodes from hosts to have a virtual interface layer. Also, modifying the exec_vnode is probably not something we can do so we can maintain backwards compatibility.

I am trying to take a view from the API’s point of view and not the server’s. This to me seems like a part of PBS’s internals. When you do a pbs_connect(), you get back one connection handle to all servers. When you do a pbs_statjob(), you get back all the jobs from all the servers in one call. Why should pbs_runjob() be any different? You didn’t need to know about the different server’s before, why start here?

Bhroam

Maintaining the cache may not be as simple as querying peer servers for node once. Moms can change reporting from one server to another which will change all cached values on the natural node as well as compute nodes reported by them. This will result in job being landed on the wrong server.
One option would be to let it fail and re-query the nodes when the server receives an unknown node error.

Another option would be to keep building the cache by broadcasting qmgr and mom updates to all servers.

But in the simpler design, the server does not have to know anything about the nodes from other clusters. The concern on the runjob-API can be resolved by making a statnode query internally within runjob when the server name is not passed in a multi-svr setup. PBS clients such as the scheduler who knows where the node resides can pass the server name.

It would be more flexible if you allowed the extend attribute to be a list of keyword=value pairs. As currently proposed, extend would be dedicated to this one function, with no place for future extensions.

E.g., extend value of on_server=server1:port,other_extension=mumble

1 Like

Thanks for the feedback. I’ve updated the doc as per your suggestion.

@bhroam,
I’ve updated the doc to mention that pbs_runjob will still function in a multi-server mode even if the extend parameter is not passed to point to the right server. IFL will do a node_stat and figure this out internally. But clients who know where the node resides can pass this info which would eliminate this extra query.

" * Upon receiving the request server will move the job asynchronously to the destination server and run the job as part of the same request."

Can you please explain what “asynchronously” means? Will server shoot and forget? or will it wait for an ack, which the receiver will send without processing the whole move and run?

I like the changes except for one thing. If a client wants to run a job on a server where the job resides, it shouldn’t have to provide the extend field. The IFL call will need to be able to determine if it needs to stat the node or not. This can be done if the client uses a server-specific socket or the virtual socket. Clients who care about performance will use both server-specific sockets and the extend field. Clients who don’t will use the virtual socket and won’t use the extend field. I don’t see a case where a client will use one and not the other.

Bhroam

You are right and this case is handled in the code. pbs_runjob will do a node_stat if only virtual fd is provided.

Job owning server will move the job and initiate the run. It is a queue_job + pbs_commit for short jobs. The run will be triggered as part of commit. The caller will not wait for all these to be finished but move on. The destination server will send a response in case of any error or when pbs_commit succeeds but before the job is sent to the mom. Source server will send a response back to the client upon receiving this reply.
I will add these info to the doc.

I just thought of a couple of more things.
First is an error condition. What happens if the client submits a runjob with a move to a server that doesn’t contain mother superior? Will the client get an error back? I doubt it with an async run, but what about a pure pbs_runjob()?

Second is pbs_runjob(). It currently waits until the server connects to mom, and the mom even runs the execjob_begin hook before replying. What will happen with a move? Will that information come back to the client, or will we return much earlier like when it is moved?

Bhroam

For the error condition, pbs_errno will indicate unknown node. There is no change in behavior.

The reply can be sent right before the job being sent to the mom so that it will be consistent with runjob behaviour today.