Soft_limit in memory subsystem is ignored

Hi all,

I set "soft_limit": true in the "memory" cgroups section, reload (import) the file, kill -HUP all MoM’s, yet it makes no effect - as soon as a job attempts to use more memory than requested, it is killed. Am I missing something obvious? "memsw" is also enabled and "swappiness" is set to 1.

OK, let me ask it differently: is anybody there for whom soft_limit works? In general, and specifically with openpbs-20.0.1?

Perhaps it would be good to look at what the cgroups settings for your job were, to determine whether the job or the kernel is playing tricks on you, or whether you simply do not have the resources (if you have no swap and ask for more memory than there is available for all jobs, then you’ll wake up the OOM killer regardless of what the cgroup limits are).

When you enable soft limits, the hard limit_in_bytes is supposed to be set to the lesser of vmem requested (i.e. swap plus physical memory) or the memory available for all jobs.

        # Adjust for soft limits if enabled
        if mem_enabled and softmem_enabled:
            softmem_limit = mem_limit
            # The hard memory limit is assigned the lesser of the vmem
            # limit and available memory
            if size_as_int(vmem_limit) < mem_avail:
                mem_limit = vmem_limit
            else:
                mem_limit = str(mem_avail)

BUT if you enabled memsw, then the hook may set the vmem limit to be the same as the mem requested or only slightly more (depending on the configuration of the memsw section, with semantics that are different depending on the version of the cgroup hook).

That is done in the section right above the one quoted here.

Asking for a larger “vmem” explicitly in your select statement may make it do what you want (rather than what you specified or failed to specify), and so would setting the default in the memsw section large enough (caveat: check whether the hook ADDS this default to the mem_limit to get the vmem_limit or whether the default becomes the total of mem plus swap allowed; that depends on the version of the cgroup hook – again, see the code snippet just above the quoted snippet.)

Hi,

Thanks for your reply. I’ve just tried enabling the soft limit without the memsw subsystem at all, with the same result (job killed with “Cgroup mem limit exceeded”). There is certainly enough physical memory (testing with a modest allocation). Just to be sure: is qmgr -c "import hook... followed by kill -HUP pbs_mom on all nodes sufficient to enable the new config?

Hmm, so apparently with no swap, limit_in_bytes and soft_limit_in_bytes are always the same. How then it’s supposed to work?

I’d expect the soft_limit to allow a process to exceed its requested memory as long as there is free RAM in the system, and if not, to move the extra to the swap, and even when that is unavailable, to kill the process.

“Hmm, so apparently with no swap, limit_in_bytes and soft_limit_in_bytes are always the same. How then it’s supposed to work?”

By specifiying a larger “vmem” resource specification for your job than “mem”. Or enabling memsw and specifying a default for vmem. Or changing the hook to do what you expect.

Well, I think it would be reasonable to set the hard mem limit to infty if vmem is not specified at all. Doesn’t it make sense? Currently, when vmem is not set, it’s treated as zero and so hard limit = soft limit.

Well, whatever I try, it doesn’t work. Now I have memsw enabled, so according to your explanation, the following job should have the hard limit of 2 (or 2.1) GB, right?

#!/bin/bash
#PBS -q N
#PBS -N test-mem
#PBS -V 
#PBS -m n
#PBS -l select=1:ncpus=1:mem=100m:vmem=2g

cd $PBS_O_WORKDIR

memctl=`grep memory /proc/$$/cgroup |cut -d: -f3`

cgget -g memory:$memctl

Yet the output still shows only 100MB (grepped for “limit_in_bytes”):

memory.kmem.tcp.limit_in_bytes: 9223372036854771712
memory.limit_in_bytes: 104857600
memory.kmem.limit_in_bytes: 9223372036854771712
memory.soft_limit_in_bytes: 104857600
memory.memsw.limit_in_bytes: 104857600

Yes. I’m not familiar with your cgroup hook, however. The fact you ALSO have a hard limit is bizarre if you have enabled soft limits – that too seems to a be a bug in the hook:

            else:
                # For all the rest just pass hostresc[resc] down to set_limit
                self.set_limit(resc, hostresc[resc], jobid)

Obviously that call should not be made for hostresc[‘mem’] if self.cfg[‘cgroup’][‘memory’][‘soft_limit’] is set.

I pointed you at the code – feel free to copy the hook and add log messages to figure out what is happening.

BTW, at DEBUG4 level there is also this to be read in the logs:
pbs.logmsg(pbs.EVENT_DEBUG4,
"Limits computed from requests/defaults: "
“mem: %s vmem: %s” % (mem_limit, vmem_limit))

And of course if this is wrong feel free to change the hook.

In the hook I’m currently using, even though I still have the bug that the hard limit is being set even when requesting soft limits, the memsw limit is set correctly:

[alexis@dragon cgroups]$ qsub -I -lselect=1:mem=100m:vmem=2gb
qsub: waiting for job 9194.dragon to start
qsub: job 9194.dragon ready
[alexis@dragon ~]$ cd /sys/fs/cgroup/memory/pbs_jobs.service//jobid/9194.dragon/
[alexis@dragon 9194.dragon]$ head memory.*limit*
==> memory.kmem.limit_in_bytes <==
9223372036854771712

==> memory.kmem.tcp.limit_in_bytes <==
9223372036854771712

==> memory.limit_in_bytes <==
104857600

==> memory.memsw.limit_in_bytes <==
2147483648

==> memory.soft_limit_in_bytes <==
104857600

I fixed the bug by inserting this snippet in front of the ‘catchall, just pass this on’ limit setting to avoid a hard limit being set too low when enabling the soft limit:

            elif (resc == 'mem'
                  and self.cfg['cgroup']['memory']['soft_limit']):
                # Don't set hard mem limit to mem requested if soft mem 
                # limits were enabled;
                # you do need to set a hard limit to the vmem limit
                # if applicable or setting the vmem limit will fail
                if 'vmem' in hostresc:
                    self.set_limit(resc, hostresc['vmem'], jobid)
            else:
                # For all the rest just pass hostresc[resc] down to set_limit
                self.set_limit(resc, hostresc[resc], jobid)

As you can see though I cannot reproduce the memsw limit not being set correctly when you specify vmem, though.

Thanks for your help. The fix didn’t help, as somehow “vmem” is outright stripped from the hostresc dictionary. Furthermore, at this point, “mem” is already set anyway, which happens several lines above. In the log, I see, e.g.,

pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [28], 'cpuset.mems': [1, 1], 'mem': 104857600}

So somehow, the requested mem value is passed to self.available_node_resources(node). Totally weird.

BTW, how do I see DEBUG4 level logs? The admin guide suggests using tracejob -f debug4 ..., but it doesn’t work.

On my version of the hook (albeit a PBSPro professional one) hostresc DOES contain vmem. hostresc is “the resources requested on this host from parsing the select statement of the job after taking into account the hook config”, so it’s not STRIPPED, but simply not being ADDED. Why is something you might have to discover by adding pbs.logmsg messages in the hook.

It may be an older hook, of course. The hook at openpbs/pbs_cgroups.PY at master · openpbs/openpbs · GitHub sdoes seem to add vmem to hostresc (with the caveat that there are a lot of “if” hurdles to pass to reach the “hostresc[‘vmem’] = pbs.size(vmem_limit)” line).

You set $logevent in the MoM config file to e.g. 0xffff to get DEBUG4 messages.

Thanks again. In fact, I use the version of the hook from github (with the small addition/fix suggested by you above), so it is indeed a puzzle why we get different results. I suspect something related to settings in pbs_cgroups.CF.

The number of if/else’s is indeed huge in the script.

Finally, I think I found the problem. In this code,

                if (vmem_default is None
                        or not self.cfg['cgroup']['memsw']
                                       ['enforce_default']
                        or (self.cfg['cgroup']['memsw']
                                    ['exclhost_ignore_default']
                            and 'place' in job.Resource_List
                            and 'exclhost'
                                in repr(job.Resource_List['place']))):
                    vmem_limit = str(vmem_avail)

I added

                        or softmem_enabled

to the list of conditions, and it solved the issue for me (but your fix is crucial as well). That is, it is no longer needed to specify vmem at all.

Quite; the supported way to do that in that hook’s config file if memsw is enabled is to set enforce_default to False.

Another not obvious (at least for me) point is that $enforce mem should not be set. Otherwise, jobs exceeding the requested mem value are still killed.

That you should not be using that has been true for at least a decade on Linux: $enforce mem actually sums the RSS values for all involved processes of a job, and in e.g. with OpenMP applications has always been doing the wrong thing (except on SGI systems in the past when a special libmemacct.so was detected, but even HPE has been dropping support for that and told its customers to use cgroups instead).

The commercial PBSPro Admin guides do go into some lengths to document those pitfalls, but IIRC they’re only part of the commercial product.

Well, the lack of [clearness in the] documentation is one thing, but how in the absence of the cgroups hook one could enforce the memory usage without $enforce mem?

You really could not, at least in general (for single process applications you could use pvmem).

In the kernel they introduced PSS (Proportional Set Size), but the interface would frequently hang on a global process table lock so no one ever took that up. Which is why SGI (later HPE) developed libmemacct.so, but it also had the same issues with the process table lock, so they developed a caching daemon for it and tuning that was really black magic (but worked).

The more important issue is that enforcing this via monitoring is tricky: sometimes rogue programs can overwhelm the system, or sooner than that have deleterious effects on memory locality of non-rogue processes, much sooner than MoM could detect the rogues.

It’s not an easy problem to solve…

Hence the “memory” cgroup controller…