Hi,
Would it be possible to report (and enforce) PSS (proportional set size) instead of RSS? Yes, that’s specific to Linux.
Evgeny
Hi,
Would it be possible to report (and enforce) PSS (proportional set size) instead of RSS? Yes, that’s specific to Linux.
Evgeny
Could you please check PBS + cgroups implementation and whether it satisfies your requirement.
Check this guide : https://www.altair.com/pdfs/pbsworks/PBSAdminGuide2021.1.pdf
Section/Chapter: Configuring and Using PBS with Cgroups
16.5.3.9 memory Subsystem
Thanks for the suggestion. This actually splits the question into two:
If I understand correctly, enabling cgroups support (which is currently off) requires shutting down the cluster, right? Except for the down time as such, should I be expecting any potential problems? I’m talking about a production cluster, so…
Why is cgroups support essential? In principle, pbs could just scan /proc/<pid>/smaps
and get the relevant information, couldn’t it?
Cgroup section in the administration guide would have all the information and steps.
It does scans the top level of /proc//smaps , if i remember the convesation correctly , for every mom polling cycle ( 120 seconds by default) – this can be decreased/increased via mom_priv/config settings.
Why Cgroups:
https://www.altair.com/pdfs/pbsworks/PBSAdminGuide2021.1.pdf check the below section
section: 16.3 Why Use Cgroups?
enabling and disablng is all via qmgr (does not even need restart of services)
Hmm, but in Sec. 16.5.4.2 I see “HUP or restart each MoM”…
Why Cgroups:
Yes, I know about the many benefits. To clarify my question: why is cgroups
support needed specifically for calculating PSS?
Thanks.
I think HUP does not kill any jobs , it will re-read the mom_priv/config (updates)
By default PBS tracks the peak memory usage by polling and sums up the RSS values for all job processes it is aware of (in case PBS is integrated with cgroups you will be able to see the real peaks).
Added:
I think HUP does not kill any jobs , it will re-read the mom_priv/config (updates)
Good to know. More important is, whether indeed this needs to be done only on selected nodes. Say, I’ll start testing cgroups on just node1 and node2. Then, if I understand correctly, I can set "run_only_on_hosts": ["node1", "node2"]
and restart MoM’s just on these two nodes, right?
BTW, earlier you pointed to the pbs_cgroups.* files in the master branch; is it to say that files from the latest official release (20.0.1) are not as good?
(in case PBS is integrated with cgroups you will be able to see the real peaks)
It seems I’m missing something. OK, with cgroups the stats are more accurate, but it doesn’t help to get PSS instead of RSS? Either the “Pss:” fields in smaps
are parsed (and added to the “Rss” values), or they are not… I’m confused.
Yes thats correct, you can test it on specific hosts only .
Master branch has the most recent updates.
Should the updated HK and PY versions be put in /var/spool/pbs/server_priv/hooks/
or /opt/pbs/lib/python/altair/pbs_hooks/
?
You would need to import hook and the configuration file as below :
Thank you,
Using PSS is slow on Linux – reading it uses kernel functonality that can hang PBS MoM on large machines with a lot of processes and kernel threads, since it basically reconstructs the PSS using all mappings and how many processes are sharing that mapping. SGI (now HPE) used to have similar functionality in their libmemacct.so library but had to write a caching daemon to make it work reliably; they have since abandoned support for that as well.
In contrast, the cgroup memory controller tags every page pulled into physical memory with a cgroup and updates counters. It does not require computation to arrive at the correct answer, nor scanning a list of processes. It also allows to do swap accounting and will also account for page cache used by the cgroup (a cgroup that hits the memory limit will see some of the page cache pages that the cgroup pulled in recycled to satisfy further memory allocations for the application.)
Thanks to everybody for the explanations!