PP-305: If server_dyn_res script does not return , Scheduler hangs

varunsonkar · May 17, 2017, 9:59am

Hi All,

As of today, if the server_dyn_res program/script does not return or hangs. The scheduler keeps on waiting for the script to complete the execution.
Following is the design document for the proposed solution to this hang issue:
Design Document

Please review the proposed design and provide the comments/feedback for same.

Regards,
Varun

scc · May 17, 2017, 1:57pm

Hi Varun, a few comments:

Currently the EDD is written to sound like there is only ever 1 server_dyn_res script (“THE server_dyn_res” is written twice, but I think it should be “A server_dyn_res”).
Related to 1), the EDD should be explicit about whether this timeout applies to ALL server_dyn_res scripts together, or EACH server_dyn_res individually.
The EDD should be explicit about how the resources related to a timed out script run is treated in the cycle in which it timed out. Does the cycle continue normally but the value of that resource is assumed to be 0?
Related to 3), assuming the timed out resource value is treated as 0 for that cycle, the log message in interface 2 should be explicit about this. That is not necessary, though, if in this scenario the log will contain something like these current messages in addition to the new message in interface 2:

05/17/2017 09:54:07;0080;pbs_sched;Svr;server_dyn_res;Error piping to program /bin/get_foo.
05/17/2017 09:54:07;0100;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0

If those messages will be printed upon timeout in addition to the new one in interface 2 I think the EDD should be explicit about it.

I am not sure I like the name in interface 1. I like the simpler “server_dyn_res_timeout” better, or maybe “server_dyn_res_alarm”. Did you insert the “prog” to try to make it clearer that the timeout applies to each individual script/program? I think if the EDD and docs are explicit about this then the simpler attribute name is better.

Thanks!

jon · May 17, 2017, 10:50pm

I think that it should apply to each server_dyn_res to match the same behavior as the alarm in hooks

This is a good questions. Sites use server_dyn_res in various ways. Some sites use it get license counts, other use it to alter jobs, etc. I think that we should assume zero and continue. If the site wants different behavior they will need to add a timeout in their script.

I prefer server_dyn_res_alarm

jon · May 17, 2017, 10:51pm

In interface 1 I believe that we should set the default value to 30 sec to match the same default as we have in hooks.

varunsonkar · May 18, 2017, 9:56am

Hi @jon and @scc,
Thanks for the comments.

Yes the timeout applies for each server_dyn_res script. Added the point in the EDD.

Thanks for putting this point I missed this. I also agree we should assume the value to be zero and continue.
Added the point for this in the EDD.

Modified the name in the EDD to ""server_dyn_res_alarm

varunsonkar · May 18, 2017, 10:04am

I have modified the EDD with following messages:
… …;0080;pbs_sched;Svr;server_dyn_res;program /bin/get_foo timed out
… …;0100;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0

I have not added “Error piping to program” message as I think it may not be a piping error just a hang issue. We have specific condition for piping error in the code which will remain intact. It is quite possible the program is piping but may be there are delays/slowness/hangs and it times out.
So I prefer keeping only “timed out” message.
Please review the updated EDD.
Thanks,
Varun

varunsonkar · May 18, 2017, 10:05am

Modified the default value to 30 sec. Please review the updated EDD.
Thanks,
Varun

jon · May 18, 2017, 2:59pm

The changes look fine. One suggestion, lets log interface 3 at the default logging level.

bhroam · May 18, 2017, 5:56pm

@varunsonkar unless the two log messages (interfaces 2/3) are required for automated testing, I would suggest making them Unstable. Log messages are something that should be able to change without 1 years notice.

Bhroam

varunsonkar · May 19, 2017, 4:48am

Modified the EDD. Please review.
Thanks,
Varun

varunsonkar · May 19, 2017, 4:51am

Hi @bhroam,
Thanks for the input I have updated the EDD as per your suggestion. Made the interfaces(2 and 3) as Unstable.
Regards,
Varun

jon · May 31, 2017, 3:59am

Thanks for the changes the EDD looks good. Since no one else has commented in the last 12 days I suggest we wait for one more day before to see if there are anymore comments before we end discussion and move forward.

vstumpf · January 15, 2020, 2:23am

I have updated the design, if server_dyn_res_alarm is 0, the scheduler will not timeout the scripts. This will effectively be the same as the previous behavior.

Topic		Replies	Views
Zombie process: server_dyn_res Users/Site Administrators	1	791	November 20, 2020
PBS interpreting server_dyn_res script output as 0 Users/Site Administrators	5	79	August 21, 2024
PP-465: qrerun timeouts when big job files are being copied from MoM to server Developers	44	4347	November 15, 2016
Job that requires a dynamic resource not starting Users/Site Administrators	2	26	February 12, 2026
PP-425 to PP-434 - Server Periodic hooks support Developers	31	3706	December 22, 2016

PP-305: If server_dyn_res script does not return , Scheduler hangs

Related topics