I am merely reminding people to not be surprised that only one ALPS release reservation request is made if they make either of the new tunables large enough so the potential combined total is longer than alps_release_timeout. It is possible to set alps_release_interval_usec to be larger than alps_release_timeout.
But I doubt any PBS admin would want to wait longer than 10 minutes between ALPS reservation requests.
Adding to what Lisa said, I think alps_release_timeout can not be set in microseconds. If we increase the interval_usec to a number that is more than alps_release_timeout (in terms of seconds) even then PBS will atleast try to cancel the reservation atleast twice before finding out that alps_release_timeout has expired.
Thanks for writing this up Lisa, the design doc is explains the new feature very nicely.
The design mentions that the interval and jitter can both be set to 1 microsecond or more. I have a few questions related to that:
if alps_release_jitter_usec is set to 1 microsecond, will the random value be set to a float between 0 and 1 microsecond? Or is the maximum precision 1 microsecond making it either 0 or 1?
If alps_release_jitter_usec and alps_release_interval_usec are both set to 1 microsecond, the release requests are supposed to be sent every 2 microseconds or less, I’m a bit concerned that that might not be enough time for the release request itself to be sent out. If the time that it takes to send each release request itself is > 2 microseconds, then will that make the behavior of when the polling occurs undefined? do we care?
One more thing, just curious, if ALPS doesn’t respond for whatever reason, is there a timeout that will be used to stop trying and bail?
I like the additional flexibility, and I have two suggestions:
Express the time in seconds (e.g., SSS.ssssss) or as a PBS Duration type (versus microseconds) – Consistency is an important aspect of good design, and PBS Pro (almost always) uses seconds and durations for time. Seconds (or a PBS duration) would both be more readable (as the default values seem to be in seconds (0.5 and 4)), and also more consistent with the other time-based values in PBS Pro. So, the design would change to something like:
Explicitly leave the resolution of time as implementation dependent – so rather than force microseconds, say something like:
The minimum time interval (resolution) is implementation dependent and may be different for different versions of ALPS and PBS Pro. In the first implementation of this capability, the durations were settable in microseconds. The supplied value may be adjusted (rounded or truncated) based on the available resolution.
Actually, it’s the amount of time that PBS will sleep (usleep) in between release requests. It is not the exact time that the requests will be sent. As you note, the whole process of making the calls may take longer (or the system may be slow, or etc.).
I need to find a way to make this clearer in my design. Any suggestions?
Ah, I see, thanks for the clarification. I think some of the same words will help clarify this in the EDD. The word ‘interval’ made me think that a release request will be sent every X microseconds, I think the word 'wait time" instead of ‘interval’ might help clarify this, just my personal opinion.
@billnitzberg Thanks for your suggestion, I like the idea. I incorporated it in to my design (some of it verbatim). @agrawalravi90 as per your suggestion, I have changed the tunable name to be alps_release_wait_time.
Passing along some feedback I got from Larry a developer from Cray. Larry says for relatively small systems (150 compute nodes) he recommends: 4/10 sec for interval 12/100 sec for jitter.
Unfortunately we don’t have larger systems to try this on to find the right sweet spot.
There is overhead (small) to the cancel request on ALPS side. Obviously the desire is to find the balance to not ping too often, and also not keep a job waiting to find out the job resv is cancelled.
Larry did not think we should get rid of the jitter capability. However he does strongly recommend we change the default jitter from 4sec to <1sec (at least).