Remove support for cpuset MoM

lisa-altair · April 8, 2020, 12:59am

There is a proposal to remove the special cpuset MoM (pbs_mom.cpuset) code from the code line.

Please see the proposal at:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1663107081/Remove+support+for+cpuset+MoM+pbs+mom.cpuset

bayucan · April 8, 2020, 11:15pm

Design looks good. Thanks.

alexis.cousein · April 9, 2020, 8:48am

Perhaps we should point out that the “cpuset” code also contains other optimisations that we need to address. In particular, there is an optimisation to avoid polling the numerous kernel threads on very large single system images (with thousands of CPU cores).

I am not advocating to keep the existing code – there are probably simpler optimisations that are as effective, and the current “cpuset” code optimisations have also introduced bugs (in particular, it broke tm_attach, something not noticed immediately since on these machines people tend not to use tm_attach but tend to have jobs where everything is in a single task with processes that are descendants of the job script shell).

We should probably link the design document for that into this one.

That is being discussed here:

and the design document is this:

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1627914285/To+optimize+job+resource+polling+discontinue+reporting+of+resources+used+values+of+PBS+root+jobs

lisa-altair · April 9, 2020, 10:03pm

Good point @alexis.cousein. I have added a link to the design.

mkaro · April 17, 2020, 5:46pm

Might we want to set vnode_per_numa_node to true by default? That might save the admins some work if it makes sense to have it enabled by default.

vccardenas · April 21, 2020, 6:06pm

I noticed that when using cpuset mom the sharing attribute of vnodes is default_excl. That attibute is default_shared when using standard mom with cgroups hook enabled and vnode_per_numa_node=true. What sharing attribute should we use?

lisa-altair · April 21, 2020, 6:07pm

Since this would be set for all platforms, and not just former cpuset ones, is vnode_per_numa_node set to true what we want generically for all platforms? I’m hoping someone with more experience with croups will know the answer.

mkaro · April 21, 2020, 6:43pm

Would like to here @alexis.cousein’s opinion on it.

alexis.cousein · April 21, 2020, 8:08pm

@mkaro Enabling vnode_per_numa_node is a very big hammer.

Even the scheduler behaviour changes if there is even one multivnode host (the vnode ordering changes). I don’t think that is something we should enforce by default. Also, if people usually use whole hosts in single chunk requests it makes the scheduler slower since it has to split chunks across different vnodes.

Enabling the flag thus increases the differences between a configuration with no cgroup hook running and a configuration where it is enabled. Mind you, I do know sites where socket-wide jobs are more efficient than host-wide ones, and then it does make sense.

On the other hand, we should definitely not set mem_fences to True on configurations where vnode_per_numa_node is False.

Setting mem_fences to True on configurations wherevnode_per_numa_node is enabled also requires a well-behaved workload (people with large NUMA hosts have, over the years, managed to make the workload behave well even when memory fences are enabled.

It is fairly important on large NUMA machines to avoid degenerate (non-local) memory allocations in the presence of rogue jobs (although it has become less so since the advent of the memory cgroup controller, which can keep rogues in check) but there are pitfalls to enforcing memory fences.

With memory fences you cannot allow jobs to straddle vnodes and share them with other jobs unless you know that they are going to allocate memory in a way that matches what the scheduler thinks. It’s really important to avoid exec_vnodes like blah[0]:ncpus=14:mem=1GB+blah[1]:ncpus=2:mem=15GB, because the job might end up allocating 14GB on blah[0], which could make the OOM killer wake up and kill jobs confined to only blah[0] when the node’s memory is depleted because a job allocates more on blah[0] and less on blah[1] than what the scheduler thinks.

Usually, to make sure it all works, jobs sharing nodes are confined to specific vnodes, and the different types (1 CPU, socket-wide, blade-wide etc.) are not mixed, so that if a vnode is shared, no one straddles the node and another one in a weird way.

alexis.cousein · April 21, 2020, 8:19pm

As for the sharing attribute, that is usually not that important, but I’d always default it to “shared” unless the site knows that they have jobs that always use a multiple of the host size.

A site that is well managed usually will have a queuejob hook adding the proper placement directives anyway (often with parallel jobs spanning more than one node always using -lplace=excl, regardless of their size, because the performance impact of sharing one node with a small job extends to the other parts of the job that may wait spinning for a laggard thread.)

I do know sites that prefer another default, but it’s possible to set that in a v2 config file. One caveat, though: you really need to include calling the script that generates the v2 config file in the init.d script for MoM, because otherwise if a machine boots with some blades missing you may be creating a v2 config file for “sharing” that makes extra vnodes that have no longer physical hardware associated to them.

As a result, it would be a good idea to add support for setting the shared attribute for vnodes in the cgroup hook’s exechost_startup routine, actually, since that is very much aware of the hardware on the host. That would be an alternative to having to tinker with the init.d file.

alexis.cousein · April 21, 2020, 8:23pm

To refer to historic practice: when PBSPro was installed on a UV-class machine it also did not start the cpuset MoM by default. We advised people that it would be a good idea, but we left it to them to change the behaviour explicitly. If we find such an approach wise, we should definitely leave vnode_per_numa_node to false by default.

We should be telling people who have large NUMA machines and machines with GPUs that they’d best change the default, though.

lisa-altair · April 21, 2020, 10:11pm

Will do. I will add this to my design.

lisa-altair · April 27, 2020, 7:06pm

@alexis.cousein and I had a chat offline about some of his suggestions. And we agree that since we don’t want new changes to get lost within the removal of cpuset code it would be best not to make those changes as part of this RFE.

lisa-altair · April 27, 2020, 8:49pm

I updated the design to also set use_hyperthreads to true.
And to kill -HUP the MoM
Please have a look.

lisa-altair · May 6, 2020, 7:59pm

The updates for memory look good.

Topic		Replies	Views
PP-586: On a Cray X-series, create a vnode per compute node Developers	40	4878	January 10, 2017
Design changes for PTL on cpuset mom. (Removal of cpuset skip decorator) Developers	18	980	August 18, 2020
Auto vnodes on NUMA system? Users/Site Administrators	1	1510	April 21, 2017
PP-325: Design document review for cgroups hook Developers	23	3460	September 29, 2017
Error starting pbs_mom: parse_config, command name vnode_per_numa_node not found Developers	1	1256	March 30, 2017

Remove support for cpuset MoM

Related topics