Cannot make cgroups work under EL9

Hello,

The cgroups hook doesn’t work for me under EL9 (specifically, AlmaLinux9). I did enable cgroups v1 on the nodes (the kernel boots with “systemd.unified_cgroup_hierarchy=0”), and I ported libcgroup and libcgroup-tools from EL8. And, of course, I enabled the hook with “set hook pbs_cgroups enabled = true” etc.

Yet the jobs are not bound by requested cpusets. E.g., if requesting “ncpus=1”, “stress --cpu 8” occupies all 8 cores by 100% on a node, while the same job under EL7 is correctly bound to a single core, with each thread running at 12.5%.

Any idea what might be wrong? OpenPBS 23.06.06.

OK, looking more carefully in the logs, I see warnings about PBS_MOM_NODE_NAME not defined, and hence job resources unavailable. Why? I’ve never defined PBS_MOM_NODE_NAME, since all vnodes are trivial host nodes themselves. Is it something new since version 20.0?

Setting PBS_MOM_NODE_NAME to what hostname returns fixes the problem. But shouldn’t it be a reasonable fallback?

That is because the natural node name used on the server to create the node does not seem to match the output of ‘hostname’. Hooks need to know the name that the server gives to the natural node, and without the variable they assume that it is the output of ‘hostname’.

Well, I believe it isn’t the case. The output of pbsnodes looks like

wn001
     Mom = wn001.x.y.z.t
     Port = 15002
     pbs_version = 23.06.06
     ...

and on the node, hostname and hostname -f return wn001 and wn001.x.y.z.t, respectively (obviously, I’ve replaced the real domain name with “x.y.z.t”).

Is there anything DNS-specific I need to check? Could the many-level domain name (four parts) be the culprit?

Hello,
I agree with this.