As of today, if the server_dyn_res program/script does not return or hangs. The scheduler keeps on waiting for the script to complete the execution.
Following is the design document for the proposed solution to this hang issue: Design Document
Please review the proposed design and provide the comments/feedback for same.
Currently the EDD is written to sound like there is only ever 1 server_dyn_res script (“THE server_dyn_res” is written twice, but I think it should be “A server_dyn_res”).
Related to 1), the EDD should be explicit about whether this timeout applies to ALL server_dyn_res scripts together, or EACH server_dyn_res individually.
The EDD should be explicit about how the resources related to a timed out script run is treated in the cycle in which it timed out. Does the cycle continue normally but the value of that resource is assumed to be 0?
Related to 3), assuming the timed out resource value is treated as 0 for that cycle, the log message in interface 2 should be explicit about this. That is not necessary, though, if in this scenario the log will contain something like these current messages in addition to the new message in interface 2:
05/17/2017 09:54:07;0080;pbs_sched;Svr;server_dyn_res;Error piping to program /bin/get_foo.
05/17/2017 09:54:07;0100;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0
If those messages will be printed upon timeout in addition to the new one in interface 2 I think the EDD should be explicit about it.
I am not sure I like the name in interface 1. I like the simpler “server_dyn_res_timeout” better, or maybe “server_dyn_res_alarm”. Did you insert the “prog” to try to make it clearer that the timeout applies to each individual script/program? I think if the EDD and docs are explicit about this then the simpler attribute name is better.
I think that it should apply to each server_dyn_res to match the same behavior as the alarm in hooks
This is a good questions. Sites use server_dyn_res in various ways. Some sites use it get license counts, other use it to alter jobs, etc. I think that we should assume zero and continue. If the site wants different behavior they will need to add a timeout in their script.
I have modified the EDD with following messages:
… …;0080;pbs_sched;Svr;server_dyn_res;program /bin/get_foo timed out
… …;0100;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0
I have not added “Error piping to program” message as I think it may not be a piping error just a hang issue. We have specific condition for piping error in the code which will remain intact. It is quite possible the program is piping but may be there are delays/slowness/hangs and it times out.
So I prefer keeping only “timed out” message.
Please review the updated EDD.
Thanks,
Varun
@varunsonkar unless the two log messages (interfaces 2/3) are required for automated testing, I would suggest making them Unstable. Log messages are something that should be able to change without 1 years notice.
Thanks for the changes the EDD looks good. Since no one else has commented in the last 12 days I suggest we wait for one more day before to see if there are anymore comments before we end discussion and move forward.
I have updated the design, if server_dyn_res_alarm is 0, the scheduler will not timeout the scripts. This will effectively be the same as the previous behavior.