In req_rescq.c:req_confirmresv, after logging the event, the server replies to the client saying their reservation is confirmed. In doing so, it calls reply_ack, which frees the batch_request. After replying, the server accesses the batch_request to write to the accounting log.
In order to fix this, I’ve moved the accounting log recording code above the reply, though this brings to light a problem that wasn’t caught before: if the accounting log recording fails (in this case, due to a malloc call failing), the server returns and doesn’t send a response to the client, which is incorrect.
My question is, should the failure of the writing to the accounting log prevent the reservation from being confirmed? Do we reject the reservation? Or are the accounting logs not necessary for confirmation?
@mkaro said this on the pr conversation:
IMHO, failure to write an accounting record should prevent the reservation from being confirmed because the accounting log would be incomplete.
So currently the only client in core PBSPro to pbs_confirmresv() is the scheduler. Rejecting a confirm has the following ramifications:
- The reservation will remain in the unconfirmed state. The scheduler will attempt to reconfirm it in the following cycle. This cycle might be up to 10 minutes away. Since the scheduler only considers reservations in that cycle, if the confirmation rejection is persistent, we’ll stall scheduling. This being said, the scheduler is highly reliant on the server. If it stops responding due to malloc() failures, scheduling will be stalled anyway.
- A pbs_rsub -I will likely timeout. If it was used with a negative number, the reservation gets deleted.
If a malloc() fails, the server is likely in a world of hurt. When one malloc() fails, more are likely to follow. This will probably hose up PBS. The server will be in an unstable state until memory is freed up.
So with all this said, I agree with @mkaro that we shouldn’t confirm the reservation. I have an additional reason though. Even if we do confirm the reservation, it might not be able to be saved to the database. Then we’re in a state where it’ll appear confirmed until the server crashes (which might happen with malloc() failing) and then it’ll come back up unconfirmed.
So in the case of requesting reservations, the server creates/alters the reservation if there are no inherent problems with the request. Only after the changes are confirmed does the server log the changes.
By the time the malloc call fails to create the accounting log buffer, the reservation is already confirmed on the server side. So in my opinion, even if the server fails to write to the accounting log, it should still ack the reservation.