Reconfirmation of running reservations

bhroam · November 22, 2019, 11:27pm

Currently if a degraded or in-conflict reservation starts running, no attempts to reconfirm it will be made. The reservation will remain short of nodes for its lifetime.

I’m making changes to replace nodes of a running reservation.

See the following design document that talks about the changes being made to how degraded or in-conflict reservations will be reconfirmed

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1401815043/Reconfirming+degraded+reservations+that+are+running

Bhroam

scc · November 25, 2019, 7:15pm

Thanks @bhroam!

One very minor comment is that “Changes to how degraded reservations” seems to be an incomplete header.

In the " New workflow of a degraded or in-conflict reservation" section, is it left purposefully vague as to HOW “PBS will determine the first time the first reservation reconfirmation will be attempted…”? (Same for how/what resv_retry will be set to in item 3 in the same section).

Will there be no way to control how long PBS will wait before attempting to reconfirm a reservation?

bhroam · November 25, 2019, 7:53pm

@scc Thanks for reviewing the document. I originally had more information about how the reservations would be reconfirmed. I had a short conversation with @billnitzberg and he suggested I keep it vague and let PBS decide. If we need to expose controls we can. If you feel strongly, I can go back to my original design where the admin explicitly controls the duration between reconfirmation attempts. As a side note, it will make testing easier since we can set the duration short.

Bhroam

bhroam · January 15, 2020, 1:21am

I’m closing in on finishing my implementation of the feature. I had to change the design a little. I found out that it is impossible to reconfirm in-conflict reservations after they run. When reconfirming a running reservation, you keep the nodes that it has and replace the nodes that are down. The problem is that in-conflict reservation’s resv_nodes is set to the list of nodes they have. The conflicted nodes were removed. It’s too hard to map the select statement to the nodes that are left.

Also to help with testing, I changed the attributes a little. Instead of having reserve_retry_init which is time the first reconfirmation attempt is made, I have resv_retry_time. This attribute is the time between attempts. Now reserve_retry_init is deprecated. The first attempt will be made after reserve_retry_time seconds after the reservation is first degraded.

Bhroam

arungrover · January 21, 2020, 6:59pm

Design looks good to me.

agurban · February 21, 2020, 7:49pm

I’d prefer it if we could change “reserve_retry_time” to “reserve_retry_interval”. It’s really an interval, and it’s not too late to make its purpose easier to understand.

arungrover · June 3, 2020, 7:21pm

Hi All,

I am fixing a reservation bug in PR https://github.com/openpbs/openpbs/pull/1795 that will make PBS server log record ‘Y’ in accounting logs every time a degraded reservation is reconfirmed.
Since this is change in external behavior, I have updated the “external interface” section of the document to reflect this change.
Please have a look and provide comments.

Thanks,
Arun

agrawalravi90 · June 4, 2020, 7:57pm

Hi Arun,

Just one comment:
<resvID> requestor=Scheduler@ start=<(new/original) start time> end=<(new/original) end time> nodes=()

Shouldn’t the resvID value be preceded with a key? e.g - “reserve_ID=<resvId>”

arungrover · June 4, 2020, 8:09pm

Thanks for reviewing it Ravi. I don’t think we add reserve_ID for any reservation related accounting logs. It is always been in “<Log-type>;<object identifier>;<attributes>” format.

There was a semicolon missing in the line though. I’ve corrected that.

agrawalravi90 · June 4, 2020, 8:09pm

Great, looks good to me then, thanks for correcting it.

Topic		Replies	Views
Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins Developers	55	4568	March 27, 2019
PP-1119 Creating Reservations Uses Memory after Free Developers	2	1031	December 5, 2017
Unable to create reservation Users/Site Administrators	1	230	November 29, 2023
PP-662, PP-663: UCR and External Interface document for Reservation enhancements Developers	91	5188	September 13, 2017
Enhancing pbs_ralter with -lselect Developers	11	1142	May 20, 2020

Reconfirmation of running reservations

Related topics