PBS cgroups and Numa Nodes issue

rchaudhary · May 24, 2022, 1:05pm

Hi

I would like to request advice on below issue please:

The scenario we have is a job submitted with an allocated amount of memory (e.g. 50gb)
What then happens is the job is submitted to a cpuset (numa node) that has less than 50gb available for that cpuset. The overall available memory on the server is more the 50gb for info.

The job is then killed by OOM killer with the constraint CONSTRAINT_CPUSET .

My question is does PBS check if the cpuset that a job is assigned to has enough available memory? If not what can we do to correct the logic?

We are using pbs_version = 19.1.3

Note -
FYI,

I checked our configuration & compare it with below doc and it is same
PP-325: Support Cgroups - Project Documentation - Confluence (atlassian.net)
Jobs killing reason - CONSTRAINT_CPUSET

¬Regards,
Ritika.

alexis.cousein · May 31, 2022, 6:34am

Dosable mem_fences in the cgroup hook Config file. Even with vnode_per_numa_node turned on unless you control your workload to avoid sharing vnodes for jobs straddling vnodes when the scheduler allocates e.g. 20gb on one socket and 30gb on another there’s no mechanism for the kernel to honour it, which might make jobs fences into a single node fail if the actual usage is different.

Topic		Replies	Views
PBS cgroups and Numa Nodes Users/Site Administrators	2	892	April 5, 2022
PBS - memory ressource (pbs_cgroup) Users/Site Administrators	3	1934	July 14, 2022
Memory restriction on all nodes Users/Site Administrators	5	1007	September 22, 2021
How can I limit the amount of memory used by a job Users/Site Administrators	3	616	April 11, 2021
Remove support for cpuset MoM Developers	14	1409	May 6, 2020

PBS cgroups and Numa Nodes issue

Related topics