PSS instead of RSS as mem

fnevgeny · August 5, 2021, 2:26pm

Hi,

Would it be possible to report (and enforce) PSS (proportional set size) instead of RSS? Yes, that’s specific to Linux.

Evgeny

adarsh · August 5, 2021, 7:56pm

Could you please check PBS + cgroups implementation and whether it satisfies your requirement.

openpbs/openpbs/blob/master/src/hooks/cgroups/pbs_cgroups.PY

# coding: utf-8

# Copyright (C) 1994-2021 Altair Engineering, Inc.
# For more information, contact Altair at www.altair.com.
#
# This file is part of both the OpenPBS software ("OpenPBS")
# and the PBS Professional ("PBS Pro") software.
#
# Open Source License Information:
#
# OpenPBS is free software. You can redistribute it and/or modify it under
# the terms of the GNU Affero General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at your
# option) any later version.
#
# OpenPBS is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Affero General Public
# License for more details.
#

This file has been truncated. show original

github.com

openpbs/openpbs/blob/master/src/hooks/cgroups/pbs_cgroups.CF

{
    "cgroup_prefix"         : "pbs_jobs",
    "exclude_hosts"         : [],
    "exclude_vntypes"       : ["no_cgroups"],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : false,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : false,
    "ncpus_are_cores"       : false,
    "discover_gpus"         : true,
    "manage_rlimit_as"      : true,
    "cgroup" : {
        "cpuacct" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : []
        },
        "cpuset" : {
            "enabled"            : true,

This file has been truncated. show original

Check this guide : https://www.altair.com/pdfs/pbsworks/PBSAdminGuide2021.1.pdf
Section/Chapter: Configuring and Using PBS with Cgroups
16.5.3.9 memory Subsystem

fnevgeny · August 5, 2021, 8:20pm

Thanks for the suggestion. This actually splits the question into two:

If I understand correctly, enabling cgroups support (which is currently off) requires shutting down the cluster, right? Except for the down time as such, should I be expecting any potential problems? I’m talking about a production cluster, so…
Why is cgroups support essential? In principle, pbs could just scan /proc/<pid>/smaps and get the relevant information, couldn’t it?

adarsh · August 6, 2021, 8:52am

does not need a system reboot or shutdown of the compute node(s).
libcgroup libcgroup-tools needs to be deployed and enabled and see whether
cgroup subsystem is mounted /sys/fs/cgroup on the compute node(s).
you can test it only on specific node(s) for testing by enabling on that specific node and disregarding rest of the compute nodes (it would not affect the rest of the cluster)
enabling and disablng is all via qmgr (does not even need restart of services)

Cgroup section in the administration guide would have all the information and steps.

It does scans the top level of /proc//smaps , if i remember the convesation correctly , for every mom polling cycle ( 120 seconds by default) – this can be decreased/increased via mom_priv/config settings.

Why Cgroups:
https://www.altair.com/pdfs/pbsworks/PBSAdminGuide2021.1.pdf check the below section
section: 16.3 Why Use Cgroups?

fnevgeny · August 8, 2021, 2:00pm

enabling and disablng is all via qmgr (does not even need restart of services)

Hmm, but in Sec. 16.5.4.2 I see “HUP or restart each MoM”…

Why Cgroups:

Yes, I know about the many benefits. To clarify my question: why is cgroups support needed specifically for calculating PSS?

Thanks.

adarsh · August 8, 2021, 7:21pm

I think HUP does not kill any jobs , it will re-read the mom_priv/config (updates)

By default PBS tracks the peak memory usage by polling and sums up the RSS values for all job processes it is aware of (in case PBS is integrated with cgroups you will be able to see the real peaks).
Added:

cgroups for near accurate memory reporting
PBS can do the memory accounting fot the the job processes it knows, if in case of MPI and it is not tightly integrated with PBS , then the memory reporting might not be accurate

fnevgeny · August 8, 2021, 9:25pm

I think HUP does not kill any jobs , it will re-read the mom_priv/config (updates)

Good to know. More important is, whether indeed this needs to be done only on selected nodes. Say, I’ll start testing cgroups on just node1 and node2. Then, if I understand correctly, I can set "run_only_on_hosts": ["node1", "node2"] and restart MoM’s just on these two nodes, right?

BTW, earlier you pointed to the pbs_cgroups.* files in the master branch; is it to say that files from the latest official release (20.0.1) are not as good?

(in case PBS is integrated with cgroups you will be able to see the real peaks)

It seems I’m missing something. OK, with cgroups the stats are more accurate, but it doesn’t help to get PSS instead of RSS? Either the “Pss:” fields in smaps are parsed (and added to the “Rss” values), or they are not… I’m confused.

adarsh · August 9, 2021, 7:41am

Yes thats correct, you can test it on specific hosts only .

Master branch has the most recent updates.

fnevgeny · August 9, 2021, 1:34pm

Should the updated HK and PY versions be put in /var/spool/pbs/server_priv/hooks/ or /opt/pbs/lib/python/altair/pbs_hooks/?

adarsh · August 9, 2021, 3:08pm

You would need to import hook and the configuration file as below :

qmgr -c ‘import hook pbs_cgroups applicaiton/x-python default pbs_cgroups.py’
qmgr -c import hook pbs_cgroups application/x-config default pbs_cgroups.cf’

Thank you,

alexis.cousein · August 10, 2021, 2:07pm

Using PSS is slow on Linux – reading it uses kernel functonality that can hang PBS MoM on large machines with a lot of processes and kernel threads, since it basically reconstructs the PSS using all mappings and how many processes are sharing that mapping. SGI (now HPE) used to have similar functionality in their libmemacct.so library but had to write a caching daemon to make it work reliably; they have since abandoned support for that as well.

In contrast, the cgroup memory controller tags every page pulled into physical memory with a cgroup and updates counters. It does not require computation to arrive at the correct answer, nor scanning a list of processes. It also allows to do swap accounting and will also account for page cache used by the cgroup (a cgroup that hits the memory limit will see some of the page cache pages that the cgroup pulled in recycled to satisfy further memory allocations for the application.)

fnevgeny · August 10, 2021, 2:14pm

Thanks to everybody for the explanations!

Topic		Replies	Views
The memory used by multiple nodes is not displayed Users/Site Administrators	12	77	August 13, 2025
PP-325: Design document review for cgroups hook Developers	23	3480	September 29, 2017
Adding support for cgroups Developers	7	3062	August 16, 2016
PBS - memory ressource (pbs_cgroup) Users/Site Administrators	3	2000	July 14, 2022
OpenPBS cgroups v2 support Users/Site Administrators	3	921	July 3, 2024

PSS instead of RSS as mem

Related topics