Scheduler Maintaining Dedicated Server Connections

All,

I have come up with the following design document - “Scheduler Maintaining Dedicated Server Connections”.

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1518108673/Scheduler+Maintaining+Dedicated+Server+Connections

Requesting the community to go through the design and provide their thoughts and suggestions.

Thanks,
Suresh

1 Like

Thanks for writing this Suresh. PFB some comments:

" As Scheduler is not closing Server connection/s we have seen Server sending an Acknowledgment for an IFL call(This is mostly the last IFL called from Scheduler for the current cycle) + Next scheduling command issued in one shot due to which Scheduler reads both of these together assuming that they are indeed the reply for only the last sent IFL call"

I’m not sure that I understand this, are you saying that we need to shift responsibilities because server will send an ack to the scheduler for scheduler’s pbs_sched_cycle_end() call? If yes, then why don’t we just not send the ack? With TCP, we shouldn’t really need an ack back from server. What do you think?

“For example, get_sched_cmd() at times read more data than just reading a scheduling command.”
That just sounds wrong, can we not fix this instead of by-passing DIS layer altogether?

Thanks @agrawalravi90 for reviewing the design document. Please find my answers in line tagged with [Suresh].

[Ravi] - I’m not sure that I understand this, are you saying that we need to shift responsibilities because server will send an ack to the scheduler for scheduler’s pbs_sched_cycle_end() call? If yes, then why don’t we just not send the ack? With TCP, we shouldn’t really need an ack back from server. What do you think?
[Suresh] Not sending an ACK for pbs_sched_cycle_end from Server is a good idea. In fact we have tried this approach and it works fine as long as we don’t see the following issue.
Issue: When we don’t send an ACK from server for pbs_sched_cycle_end, Server side DIS routines have the possibility of getting/reading two IFL requests from Scheduler in ONE SHOT i.e. our initial problem is reiterated this time at Server side instead of Scheduler side.

I am just giving an example below but there exist other similar cases which we have seen during POC.

For example assume that we have the following situation.

pbs_ifl1() for which no ACK is sent from Server
schedule() ===> Let’s say as part of this function first IFL is pbs_statrsc()
pbs_ifl2() for which no ACK is sent from Server

pbs_ifl1() or pbs_ifl2() can be pbs_sched_cycle_end for example.

So in this case Server gets two IFL requests(request for pbs_ifl1() + pbs_statrsc()) in one shot which is where the initial problem is reiterated.

Due to the issue mentioned above this approach does not help us completely solve the problem. These kind of issues are happening because Scheduling commands and IFL request/response are both sent on the primary connection. So one solution is to bifurcate sending Scheduling commands and IFL to and fro communication to secondary and primary connections and hence we came up with the solution proposed. Also segregating the responsibilities of primary and secondary connections has the following advantages.

  1. No chance for race conditions like the ones we have talked
  2. More scalable
  3. No need to change any existing interfaces and functionality of Scheduler.

“For example, get_sched_cmd() at times read more data than just reading a scheduling command.”
[Ravi] - That just sounds wrong, can we not fix this instead of by-passing DIS layer altogether?
[Suresh] - This is very intermittent issue and in fact we have tried fixing the issue by reading the additional left over data from DIS buffers if present before we proceed reading the data again from the socket. During this approach we learned that we have to change several common DIS routines which are widely spread across all PBS code and the impact(development and testing) is huge and hence we were thinking of a simple approach which solves the issue in our hand and also more scalable, no external interfaces change and at the same time has no/very less impact on other parts of PBS code and hence came up with the design proposed.

I’m not sure I understand. The server must get bombarded by IFL requests all the time from the same connection, does this problem not happen today? DIS itself must be capable of parsing one message at a time from the stream buffer right?

But this sounds like a bug. disrsl() is supposed to read only 1 number from the stream. If it’s reading more than that then we should fix the code anyways right?

If the Scheduler does not close Server connection for every scheduling cycle, we can certainly see this issue today as well.

Yes. This is what I was expecting initially. But the issue in hand is very intermittent in nature and at times DIS routines are reading more than what is expected. Don’t really know whether this is by design. If yes then we should change lot of DIS routines as I have explained earlier and the impact is huge.

Yes it looks like a bug. But again we don’t know whether it is by design. This happens very intermittenlty. Reason I am saying this is it looks like DIS reads all the data at times and keeps it in buffers. But our application logic does not know the fact that some data is left over in DIS buffers and hence avoids to read the same. If we would really want to go in the direction of fixing all these DIS routines we should also know the original intent behind this design and as I said earlier if we go by this approach the impact is huge. If we look at from another angle we strongly feel that these kind of issues are happening because sending scheduling commands and IFL communication are both happening in primary connection and hence we thought of coming up with the current design which does not give any scope to these intermittent issues plus it is more scalable.

I don’t think it is by design. The function is supposed to parse and return only 1 integer/long value back, and it’s clear from the code that it is only supposed to parse one number from the stream buffer.

I’d like to hear what @subhasisb thinks about this, but personally, I don’t think it’s good to base design around a bug in the code. Because if we fix this bug and DIS can do the right thing, then we don’t need to shift the responsibilities of primary and secondary connections.

About the new IFL call, I thought more about what @hirenvadalia had suggested during this morning’s call, and I feel like he might have a good point that this doesn’t need to be an IFL call. This function will only be used by the scheduler, and we aren’t really passing much data with this call, so why not just create a IS_SCHED_CYCLE_END message instead and pass it directly over the persistent connection that sched and server will have?

The first thing we have tried when we encountered this bug is try to fix the DIS issues that we are talking about. We in fact tried to do a small POC around the same and learned that fixing all DIS routines and the callers who are using these routines has huge impact. Along with this what triggered us to move to the current approach is the following.

  • As of today Scheduler already has primary and secondary connections. Secondary connection is only responsible for sending a single super high priority command to restart scheduling cycle. In other words it is not getting used for any other purpose apart from this. This led us to think in the direction of why can’t we shift the responsibilities of these connections and get the desired benefit.

  • Separating sending scheduling commands and IFL communication has another advantage and fit for the following upcoming requirement which we have to meet.
    “Scheduler should decide the priorities of scheduling commands instead of Server”.
    Going forward multiple servers talk to scheduler in parallel and scheduler should be able to decide the priorities of all commands coming from all servers correctly and act accordingly. So if we dedicate a connection just to read and collate, prioritise all these commands we can very well have a thread to accomplish this and thus by we achieve better scalability also.

In fact I am also requesting @subhasisb to put his thoughts on the same as he is aware of complete background on this.

As I have explained in the call today sending commands to Scheduler is uni directional(Server only send commands to Scheduler) i.e. Scheduler does not send any commands to Server rather it uses IFL to talk to Server. Server treats Scheduler as just another client and that’s why IFL is used. So unless we change the design i.e. to make Scheduler also send commands to Server we cannot go with creating a command like IS_SCHED_CYCLE_END.

@agrawalravi90 AFAIK about DIS, DIS routines parse and returned requested chars only from its read buffer.
@suresht and @agrawalravi90 If I understand the problem correctly then the issue is in PBSD_rdrpy_sock (here), once IFL API reads its reply (assuming IFL API needs reply/ACK from another party) it clears DIS’ read buffer, and because of that next command on that connection will be lost…

Thanks @agrawalravi90 for mentioning this :slight_smile: But just one correction, I was talking about new SCH_* command like SCH_SCHEDULE_END (as defined in sched_cmds.h) instead IS protocol command.

Yes @hirenvadalia this is correct as per my knowledge

The issue is little different @hirenvadalia. At times Scheduler gets ACK+next sched command together but the scheduler is expecting only a ACK from server for the previous cycle but not next command also at that time and hence it reads ACK and then stops).
The left over data(after reading a ACK correctly) is actually present in DIS buffers but the application code does not have any idea regarding this and in this case Scheduler does a select to get next sched command which never happens because the data already came in the last occurrence. So here select indefinitely blocks. We can discuss more on this.

All,

I have added the additional information that I have mentioned here to the design document also. Please take a look.

This would be a perfect opportunity to ask this question: “Why do we need to demarcate start and end of scheduling cycle?”

Another way to think of this might be a need for a global flag that just says “something changed in the world”, so, when the scheduler has time it should attend to it.

Perhaps we can find some good ideas to address this complex requirement of each server telling scheduling to start and the scheduler needing to tell each of the servers that scheduling has ended

“command” is a strong word, I look at this as just a response from the sched to server’s original request to run a sched cycle. This would be similar to the messages that the MoM sends to the server. The server will just need to add the scheduler’s connection to its tpp poll list, whenever sched replies back to it, it’ll get notified and know that sched cycle is over. You are proposing that the connection be persistent, why not take advantage of that? Would it be a major redesign?

@suresht thanks for explaining things in person. From what you said, it sounds like the sched will keep persistent connections on both the batch_requests side, and the command side. In which case, if choosing between a batch_request vs a direct connection write just comes down to the cost of DIS encoding, then I’m ok with either, as long as we don’t call this an IFL call.

@subhasisb Interesting idea … I’ll throw out another thought for a different route in the future: what if we removed the start/end sched cycle mechanism altogether and just let the scheduler decide when it needs to run a sched cycle? sched needs to stat the universe anyways, what if it just stat’d the diff instead on a regular basis and decided from the delta if it needs to run a sched cycle?

@bhroam requesting inputs from you as well, thanks.

Thanks for the poke @agrawalravi90. I’ve been meaning to reply here, but have been busy.

Thanks for posting this @suresht and getting this conversation started.

My thoughts:
Your design talks about two things that are at ends with each other.

The first talks about a new IFL call to tell the server of the end of the cycle. The reason the server knows this now so it can stop sending the scheduler commands until the cycle is over.

The second part of the design talks about reversing who decides of the priorities of the scheduler commands from the server to the scheduler.

Right now when during a cycle, the server keeps all the qrun requests, one high priority command, and one normal priority command. If many different commands come in, we only send the one kept command (I think it is the last one set). If we continue doing this, we haven’t really changed who decides the priority of the scheduler commands. The only thing the scheduler gets to decide about now is the priorities of the scheduler commands between the different servers.

If we truly want to let the scheduler decide about the priority of its commands, then we need to send them all to the scheduler and allow it to do just that.

Right now the most of the different commands do the same thing which is run a cycle. It doesn’t always have to be this way. The sched team has ideas about the future vision of the scheduler. Once all the PBS data is in the database, there isn’t any reason the scheduler can’t query it from there. It can do the same thing that the other servers are doing and ask, “what changed between now and the previous time I asked”. That allows us to become persistent. We won’t have to query the entire universe every time. In a cycle with 1M jobs and 50k nodes, it takes the scheduler 30s to query the entire universe. If only 10 new jobs have come in, it will be much faster to only query those 10. If we know that we got a SCH_SCHEDULE_NEW, then we know we only have to schedule new jobs (and more importantly, don’t rebuild the calendar if we don’t need to). We won’t have to see if all the jobs can run, only the ones which are new. If something ended, we can run a full cycle. If a reservation has started, we only need to run jobs within that reservation.

With all that said, could a single ‘something has changed’ bit take the place of all the different commands? If we can become persistent I just described, then probably. We asked ‘what changed?’ We will get back 10 new jobs, so we can infer that we only need to run those.

@suresht I agree with @agrawalravi90 that it sounds like DIS is broken and should be fixed. It sounds like what you are doing is a workaround. I personally would rather see DIS be fixed even if the effort is higher because it will make PBS more maintainable in the future.

Bhroam

I have done some minor changes to the document while we are going through the other thoughts that are mentioned here.