Hey,
I am proposing the enhance pbs_ralter to allow it to modify a reservation’s select statement. Here is my design
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1664090170/Enhancing+pbs+ralter+for+-lselect
Bhroam
Hey,
I am proposing the enhance pbs_ralter to allow it to modify a reservation’s select statement. Here is my design
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1664090170/Enhancing+pbs+ralter+for+-lselect
Bhroam
Is there a use case for submitting a job that will make the reservation release nodes? If so, do we need an environment variable like pbs_release_nodes has?
I thought about that. I don’t think so. It makes sense for a job because you are releasing nodes from the job you are in. A reservation is like an outer shell. The job shouldn’t really be releasing nodes from it (especially since it might be running on one of those nodes). I think this should be used by the owner of the reservation or the admin directly. Not by a job script running in the reservation.
Bhroam
Just because I had to ask: what about when it’s a job-specific reservation?
So what you guys are asking for is that if a pbs_ralter happens in a job that is in a reservation, that you don’t need to provide the reservation being altered? It’ll just pick it up from job itself? I’m not sure I like it. Keep in mind this would be a two step process. First the script would have to release the nodes from the job, and then release them from the reservation.
I think things would be cleaner if reservations are altered by the owner/admin directly.
I agree I don’t think the reservation needs to self adjust to the job it was created for. A job-specific reservation once created shouldn’t be any different from a normal reservation which is fully occupied by running jobs. If a user is releasing nodes from the job then they can also release it from the reservation.
I think there was another use case for job-specific reservations where the job owner would allow other users to submit jobs into their reservation. If that happens then releasing nodes from jobs shouldn’t affect the reservation because it might be shared with multiple users.
Hi @bhroam,
A couple questions:
For an already-running reservation, is the Y accounting record written at the time the reservation changes (in a similar way to how B is written at the time the reservation starts)?
The Technical Details in the design left me wondering about the select statement options. Is the design saying that the select given to pbs_ralter must in some way be close/similar to the select given to pbs_rsub? To illustrate, can I do:
pbs_rsub -l select=3:model=ivy
# reservation created with nodes A, B, and C
pbs_ralter -l select=1:model=ivy:host=A+1:model=ivy:host=C
This lets me be picky when needed, and allows me to gin up some wrappers that drop specific node(s) (e.g. a wrapper to drop just the down/offline nodes).
Another angle, for future feature possibilities of pbs_ralter - in a different forum (not this site) there has been discussion of using something like “-W force” to let the user get something done and take responsibility for the (possibly disastrous) outcome. You may not have been in that discussion, but @scc was. The immediate context of that discussion was extending the duration of a running reservation, even if it has down/offline nodes. I’m not asking for that angle here and now, but it’d be nice if implementation of the design minimized difficulty of such a thing in the future.
-Greg
Yes, the Y record will be at the time of the reconfirmtion. This is how it is today with pbs_ralter
I wasn’t planning on allowing that. The problem is the chunks are different. It would be difficult for the scheduler to map the chunks in the select to the chunks in exec_vnode. I guess we could make a special case for host/vnode maybe. I was planning on getting around this by doing the smart thing when selecting vnodes to drop. First pick unavailable vnodes, and then pick vnodes that are up, but don’t have an running jobs on them from the reservation. This way you can just say “-lselect=2:model=ivy” and the scheduler will choose wisely. Is there a use case to drop specific nodes as long as the scheduler does the right thing and doesn’t try to drop nodes in use?
This is an interesting angle. This would require some changes to the scheduler’s node searching code because it will not consider nodes that are down or offline as eligible for a request. I guess if we made the node searching algorithm care less about avoiding bad nodes, it’d work. The reservation would immediately go degraded though. As for doing anything here that would make that hard? No, I’m currently just calling the node search code. That would require changes to the node search code to not care about certain things it cares about today.
I do agree it could be possibly disastrous. I’d definitely want that as an admin only option since you could extend one reservation into another.
Bhroam
Ok gotcha. I can be picky with your approach, I just need to run the extra step of offlining the node(s) I want to drop (then onlining after the pbs_ralter succeeds). Use cases generally fall in the category of admin-knows-the-future scenarios: I want to get hardware work done on node 2 of a rack IRU, and that means nodes 11, 20, and 29 need to be pulled from the reservation (E-cell node layout). We charge users for all nodes assigned to a reservation so we have some incentive to not just leave offlined/unusable nodes in the reservation.
-Greg
Another option is to set reserve_retry to something small like 60 and then offline the nodes. The scheduler will attempt to find other nodes for the reservation. This way you can do your maintenance while maintaining the same number of nodes in the reservation. If the reconfirmation attempt fails, then you can ralter the reservation to drop the nodes.
Alternatively you can create a maintenance reservation for those nodes. This will cause the nodes to be stripped from the reservation and cause the reservation to go in-conflict. The downside here is that a running/in-conflict reservation can no longer be raltered.
This is an unfortunately side effect of how maintenance reservations work. Their resv_nodes attribute is modified and the overlapping nodes are stripped out. This makes it impossible to map a node in the resv_nodes to chunks in the select spec. This is required so we can make sure to keep the same nodes the reservation still has.
Hmm, this sent me looking through the code. When a maintenance reservation strips nodes from in-conflict reservations, it will not log in the accounting log. This means there is no way for you to account properly that the reservation no longer has nodes that were stripped from it. There will be a new ‘Y’ record if the reservation is reconfirmed, but that only happens if it is not running yet. Of course there is no record in the accounting log when a node goes down and the reservation is in the degraded state. I guess this is similar. In-conflict was meant to be another degraded state. The maintenance reservation RFE went in before reconfirming running reservations, so I guess the lack of accounting log makes sense.
Bhroam
We at Altair discussion about this feature today and I wanted to post the main takeaways here for comment and posterity as they will impact the feature and design under discussion:
If a site makes use of the resvsub hook to alter a submitted reservation, the reservation’s ultimate schedselect may be transformed from what the user initially submitted in such a way that crafting a select statement for use with pbs_ralter -lselect may not be possible without admin intervention. This is because normal users cannot see the reservation’s schedselect attribute via pbs_rstat -f. Similarly, if a reservation comes into being via the new create_resv_from_job mechanism there are also opportunities for what the user submitted as their job’s select statement to be transformed into what is ultimately the reservation’s schedselect (namely, queue default/limits and queuejob hook).
Additional functionality/capability to mitigate this is not being planned now, but if this is a real problem in the future we may have to consider something such as exposing the schedselect attribute to users to use as basis for input to pbs_ralter -lselect (then, invisible resources will be a problem, among other possible complications) and/or a new hook event for reservation alteration.
At least in the initial implementation, the design as it stands right now will be amended to state that the requested select spec can only request fewer chunks of the same resources for a reservation regardless of state, not just running reservations.
The real world use case for NOT allowing growing before a reservation starts running is as follows: Site has a mechanism to disallow reservations starting in the next 6 days. The reason is that their max job runtime (walltime) is 6 days and they do not want a top job start time to be pushed out by a reservation being confirmed.
@scc
Most of what you said above is correct except one thing. A hook can not directly modify the schedselect. This is created internally by the server. Any modifications to the select by an rsub hook will be made to the original select statement itself. If the select was modified, the user can see it in a pbs_rstat -f before running ralter. I don’t think there is a reason we need to expose the schedselect attribute to the user in the future.
I have modified the design document as @scc said. Please take a look