RunJob API will accept destination server identifier in extend parameter

The reply is not sent before the job is sent to the mom. It is sent after the mom replies back. It allows the job to be rejected by the execjob_begin hook.

How are you going to handle this because the server sending the job to the mom is not the one that received the runjob batch request.

Bhroam

The source server which has received the runjob request will send the requests to dest server and proceed with other tasks. The dest server will reply based on the nature of the request (before or after) being sent to mom. Upon receiving the response the final reply will be sent to the client.

@nithinj what happens if the destination server rejects the move+run request? does the job remain on source server? can there be a race condition where scheduler stats the servers and both servers report the same job?

or one where a different client stats the servers and sees 2 copies of a job?

Source server will mark the job in ā€˜Tā€™ - transit state at the beginning of the operation. The job will be dequeued when it gets a successful response from the destination. So there can be a case where both source and destination server reports the same job, but the source server will be reporting it in T state.

The scheduler will not see this as it will make sure that all servers are in a consistent state before querying the universe.

Ok, thanks for clarifying. I think we might want to avoid this by delaying response to the clients until the move and run is complete. Itā€™s a rare event so it shouldnā€™t affect the clients that much. Many front end tools might flip out if they see 2 copies of the same job, so I think it would be better to avoid it if possible.

Delaying responses will give inconsistent data to the client. When a server is ready to serve the request other servers may not be ready. That will lead to a situation where different server instances are replying at different points in times leading to inconsistent data. We still can achieve the same by pbs_server_ready like protocol which is used between scheduler and server to check all servers are ready to server request.

  • pbs_server_ready
    This API can be published which can be used by clients to get a consistent response. This API will block the client until all servers are ready for response.
    The issue with the approach is that the client will get a reply only after all the inter-server operations are finished. These requests will also get piled up in the server making it irresponsive as it also may have to respond to the scheduler requests once the inter-server operations are finished.

  • Let IFL remove if there is a duplicate
    The client might have to go through all jobids to find the id in T state. Server should indicate move+run happening using some bits to avoid doing this all the time. Iā€™m hoping the response would be still faster than waiting. But with this, we are inventing another way to handle the same issue.

  • Let servers send the job ids. Mark the state appropriately as in Transit.
    So even if the client receives two jobids and they are the same, the client can distinguish between them based on the state.

I am inclined to continue with what we have #3 until we hear more on this. We can publish pbs_server_ready and integrate with more clients if that is required. Let me know what do you think.

With PR: Multi-Server: MN jobs crossing local server boundaries by nithinj Ā· Pull Request #2253 Ā· openpbs/openpbs Ā· GitHub the server will keep a minimal cache of nodes from other servers. I am going to remove this design page as the server knows where to send the job with the help of this cache and doesnā€™t have to rely on the on_server field. Let me know if you have any concerns.

Donā€™t delete it, just mark it ā€œobsoleteā€ or something similar. That way, if by any chance we need to go back to the old design or refer to it, itā€™ll still be there.

1 Like