Remove support for cpuset MoM

There is a proposal to remove the special cpuset MoM (pbs_mom.cpuset) code from the code line.

Please see the proposal at:

Design looks good. Thanks.

Perhaps we should point out that the “cpuset” code also contains other optimisations that we need to address. In particular, there is an optimisation to avoid polling the numerous kernel threads on very large single system images (with thousands of CPU cores).

I am not advocating to keep the existing code – there are probably simpler optimisations that are as effective, and the current “cpuset” code optimisations have also introduced bugs (in particular, it broke tm_attach, something not noticed immediately since on these machines people tend not to use tm_attach but tend to have jobs where everything is in a single task with processes that are descendants of the job script shell).

We should probably link the design document for that into this one.

That is being discussed here:

and the design document is this:

Good point @alexis.cousein. I have added a link to the design.

Might we want to set vnode_per_numa_node to true by default? That might save the admins some work if it makes sense to have it enabled by default.

I noticed that when using cpuset mom the sharing attribute of vnodes is default_excl. That attibute is default_shared when using standard mom with cgroups hook enabled and vnode_per_numa_node=true. What sharing attribute should we use?

Since this would be set for all platforms, and not just former cpuset ones, is vnode_per_numa_node set to true what we want generically for all platforms? I’m hoping someone with more experience with croups will know the answer.

Would like to here @alexis.cousein’s opinion on it.

@mkaro Enabling vnode_per_numa_node is a very big hammer.

Even the scheduler behaviour changes if there is even one multivnode host (the vnode ordering changes). I don’t think that is something we should enforce by default. Also, if people usually use whole hosts in single chunk requests it makes the scheduler slower since it has to split chunks across different vnodes.

Enabling the flag thus increases the differences between a configuration with no cgroup hook running and a configuration where it is enabled. Mind you, I do know sites where socket-wide jobs are more efficient than host-wide ones, and then it does make sense.

On the other hand, we should definitely not set mem_fences to True on configurations where vnode_per_numa_node is False.

Setting mem_fences to True on configurations wherevnode_per_numa_node is enabled also requires a well-behaved workload (people with large NUMA hosts have, over the years, managed to make the workload behave well even when memory fences are enabled.

It is fairly important on large NUMA machines to avoid degenerate (non-local) memory allocations in the presence of rogue jobs (although it has become less so since the advent of the memory cgroup controller, which can keep rogues in check) but there are pitfalls to enforcing memory fences.

With memory fences you cannot allow jobs to straddle vnodes and share them with other jobs unless you know that they are going to allocate memory in a way that matches what the scheduler thinks. It’s really important to avoid exec_vnodes like blah[0]:ncpus=14:mem=1GB+blah[1]:ncpus=2:mem=15GB, because the job might end up allocating 14GB on blah[0], which could make the OOM killer wake up and kill jobs confined to only blah[0] when the node’s memory is depleted because a job allocates more on blah[0] and less on blah[1] than what the scheduler thinks.

Usually, to make sure it all works, jobs sharing nodes are confined to specific vnodes, and the different types (1 CPU, socket-wide, blade-wide etc.) are not mixed, so that if a vnode is shared, no one straddles the node and another one in a weird way.

As for the sharing attribute, that is usually not that important, but I’d always default it to “shared” unless the site knows that they have jobs that always use a multiple of the host size.

A site that is well managed usually will have a queuejob hook adding the proper placement directives anyway (often with parallel jobs spanning more than one node always using -lplace=excl, regardless of their size, because the performance impact of sharing one node with a small job extends to the other parts of the job that may wait spinning for a laggard thread.)

I do know sites that prefer another default, but it’s possible to set that in a v2 config file. One caveat, though: you really need to include calling the script that generates the v2 config file in the init.d script for MoM, because otherwise if a machine boots with some blades missing you may be creating a v2 config file for “sharing” that makes extra vnodes that have no longer physical hardware associated to them.

As a result, it would be a good idea to add support for setting the shared attribute for vnodes in the cgroup hook’s exechost_startup routine, actually, since that is very much aware of the hardware on the host. That would be an alternative to having to tinker with the init.d file.

To refer to historic practice: when PBSPro was installed on a UV-class machine it also did not start the cpuset MoM by default. We advised people that it would be a good idea, but we left it to them to change the behaviour explicitly. If we find such an approach wise, we should definitely leave vnode_per_numa_node to false by default.

We should be telling people who have large NUMA machines and machines with GPUs that they’d best change the default, though.

Will do. I will add this to my design.

@alexis.cousein and I had a chat offline about some of his suggestions. And we agree that since we don’t want new changes to get lost within the removal of cpuset code it would be best not to make those changes as part of this RFE.

I updated the design to also set use_hyperthreads to true.
And to kill -HUP the MoM
Please have a look.

The updates for memory look good.