We are running into an issue with managing available memory on our compute nodes. In our qmgr config, we set resources_available.mem to a lower value (by some fixed amount) than the total physical memory on the machine, in order to prevent the OS from being starved. For example, if a node has 64gb physical memory, we might do something like:
set node node0001 resources_available.mem = 54gb
The problem is, whenever the PBS mom process restarts – either due to a reboot or just a service restart – there is a mom hook that resets the available memory to the total physical memory. From the server logs:
06/18/2020 09:44:53;0100;Server@bright01;Node;node0001.thunder.ccast;Updated vnode node0001's resource resources_available.mem=65336320kb per mom hook request
Can we somehow change this behavior to prevent our memory values from being overwritten?
Thank you for the tips. How would I go about determining which hook is changing the resources, and then how do I change it? We’ve created and deployed custom hooks, so I know the workflow there. But this seems to be caused by a hook that’s part of the default PBS installation. Can you point me to somewhere in the documentation that describes how to do what you’re suggesting?
The MoM logs should log the hooks being run at a higher log level, I’m not sure the exact level on the top of my head, but for debugging 0xffff always works. (8.10 in the Hooks guide, Error Reporting and Logging)
Do you have the cgroups hook enabled? If it’s set to create vnodes, it will use all the available memory on the machine. You can edit this with the cgroup’s config file with the keys ‘reserve_amount’ and ‘reserve_percent’. (15.4 in the Admin guide, Configuring Cgroups)