PSS instead of RSS as mem


Would it be possible to report (and enforce) PSS (proportional set size) instead of RSS? Yes, that’s specific to Linux.


Could you please check PBS + cgroups implementation and whether it satisfies your requirement.

Check this guide :
Section/Chapter: Configuring and Using PBS with Cgroups memory Subsystem

Thanks for the suggestion. This actually splits the question into two:

  1. If I understand correctly, enabling cgroups support (which is currently off) requires shutting down the cluster, right? Except for the down time as such, should I be expecting any potential problems? I’m talking about a production cluster, so…

  2. Why is cgroups support essential? In principle, pbs could just scan /proc/<pid>/smaps and get the relevant information, couldn’t it?

  1. does not need a system reboot or shutdown of the compute node(s).
  2. libcgroup libcgroup-tools needs to be deployed and enabled and see whether
    cgroup subsystem is mounted /sys/fs/cgroup on the compute node(s).
  3. you can test it only on specific node(s) for testing by enabling on that specific node and disregarding rest of the compute nodes (it would not affect the rest of the cluster)
  4. enabling and disablng is all via qmgr (does not even need restart of services)

Cgroup section in the administration guide would have all the information and steps.

It does scans the top level of /proc//smaps , if i remember the convesation correctly , for every mom polling cycle ( 120 seconds by default) – this can be decreased/increased via mom_priv/config settings.

Why Cgroups: check the below section
section: 16.3 Why Use Cgroups?

enabling and disablng is all via qmgr (does not even need restart of services)

Hmm, but in Sec. I see “HUP or restart each MoM”…

Why Cgroups:

Yes, I know about the many benefits. To clarify my question: why is cgroups support needed specifically for calculating PSS?


I think HUP does not kill any jobs , it will re-read the mom_priv/config (updates)

By default PBS tracks the peak memory usage by polling and sums up the RSS values for all job processes it is aware of (in case PBS is integrated with cgroups you will be able to see the real peaks).

  • cgroups for near accurate memory reporting
  • PBS can do the memory accounting fot the the job processes it knows, if in case of MPI and it is not tightly integrated with PBS , then the memory reporting might not be accurate

I think HUP does not kill any jobs , it will re-read the mom_priv/config (updates)

Good to know. More important is, whether indeed this needs to be done only on selected nodes. Say, I’ll start testing cgroups on just node1 and node2. Then, if I understand correctly, I can set "run_only_on_hosts": ["node1", "node2"] and restart MoM’s just on these two nodes, right?

BTW, earlier you pointed to the pbs_cgroups.* files in the master branch; is it to say that files from the latest official release (20.0.1) are not as good?

(in case PBS is integrated with cgroups you will be able to see the real peaks)

It seems I’m missing something. OK, with cgroups the stats are more accurate, but it doesn’t help to get PSS instead of RSS? Either the “Pss:” fields in smaps are parsed (and added to the “Rss” values), or they are not… I’m confused.

Yes thats correct, you can test it on specific hosts only .

Master branch has the most recent updates.

Should the updated HK and PY versions be put in /var/spool/pbs/server_priv/hooks/ or /opt/pbs/lib/python/altair/pbs_hooks/?

You would need to import hook and the configuration file as below :

  1. qmgr -c ‘import hook pbs_cgroups applicaiton/x-python default’
  2. qmgr -c import hook pbs_cgroups application/x-config default’

Thank you,

1 Like

Using PSS is slow on Linux – reading it uses kernel functonality that can hang PBS MoM on large machines with a lot of processes and kernel threads, since it basically reconstructs the PSS using all mappings and how many processes are sharing that mapping. SGI (now HPE) used to have similar functionality in their library but had to write a caching daemon to make it work reliably; they have since abandoned support for that as well.

In contrast, the cgroup memory controller tags every page pulled into physical memory with a cgroup and updates counters. It does not require computation to arrive at the correct answer, nor scanning a list of processes. It also allows to do swap accounting and will also account for page cache used by the cgroup (a cgroup that hits the memory limit will see some of the page cache pages that the cgroup pulled in recycled to satisfy further memory allocations for the application.)


Thanks to everybody for the explanations!