The memory used by multiple nodes is not displayed

wakaka · August 1, 2025, 6:47am

After a job is executed on multiple nodes, the memory used by the job is not displayed when viewing it using qstat -xf.

qstat -xf 518:
Job Id: 518.e004-2
Job_Name = 1066
Job_0wner = job_user@e004-2
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.ncpus = 36
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = F
queue = q200
server = e004-2
exec_host = e005/0**18+e006/0**18
exec_vnode = (e005:ncpus=18)+(e006:ncpus=18)
……

adarsh · August 1, 2025, 7:00am

You would need tight integration. if in case you are using MPI Flavours (Intel MPI, OpenMPI).
Otherwise, you would have to use cgroups for proper accounting of resources.

wakaka · August 1, 2025, 8:52am

The problem is, I have used cgroups。

[root@e004-2 ~]# qmgr -c ‘l h’
Hook pbs_cgroups
type = site
enabled = true
event = execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic, exechost_startup
user = pbsadmin
alarm = 90
freq = 120
order = 100
debug = false
fail_action = offline_vnodes

pbs_cgroups:
{
“cgroup_prefix”: “pbs_jobs”,
“exclude_hosts”: ,
“exclude_vntypes”: [“no_cgroups”],
“run_only_on_hosts”: ,
“periodic_resc_update”: true,
“vnode_per_numa_node”: false,
“online_offlined_nodes”: true,
“use_hyperthreads”: false,
“ncpus_are_cores”: false,
“discover_gpus”: true,
“manage_rlimit_as”: true,
“cgroup” : {
“cpuacct”:{
“enabled”: true,
“exclude_hosts”: ,
“exclude_vntypes”:
},
”cpuset" : {
“enabled”: true,
“exclude_cpus”: ,
“exclude_hosts”: ,
“exclude_vntypes”: ,
”allow_zero_cpus": true,
“mem_fences”: false,
“mem_hardwall”: false,
“memory_spread_page”: false
},
”devices": {
“enabled”: true,
“exclude_hosts”: ,
“exclude_vntypes”: ,
“allow”: [
“b *:* rwm”,
“b 7:* rwm”,
“c *:* rwm”,
“c 195:* m”,
“c 136:* rwm”,
[“infiniband/rdma_cm”,“rwm”],
[“fuse”,“rwm”],
[“net/tun”,“rwm”],
[“tty”,“rwm”],
[“ptmx”,“rwm”],
[“console”,“rwm”],
[“null”,“rwm”],
[“zero”,“rwm”].
[“full”,“rwm”],
[“random”,“rwm”].
[“urandom”,“rwm”],
[“cpu/0/cpuid”,“rwm”,“*”],
[“nvidia-modeset”, “rwm”],
[“nvidia-uvm”,“rwm”],
[“nvidia-uvm-tools”“rwm”],
[“nvidiactl”, “rwm”]
]
},
“memory”: {
“enabled”: false,
“exclude_hosts”: :
“exclude_vntypes”: ,
“soft_limit”: false,
“enforce_default”: false,
“exclhost_ignore_default”: false,
“default”: “256MB”,
“reserve_percent”: 0,
“reserve_amount”: “0GB”
},
“memsw” : {
“enabled”: false,
“exclude_hosts”: ',
“exclude_vntypes”: ,
“enforce_default”: true,
“exclhost_ignore_default” : false,
“default”: “0B”,
“reserve_percent”: 0,
“reserve_amount”: “64MB”,
“manage_cgswap”: false
},
“hugetlb” : {
“enabled”: false,
“exclude_hosts”: ,
“exclude_vntypes”: ,
“enforce_default”: true,
“exclhost_ignore_default”: false,
“default”: “0B”,
“reserve_percent”: 0,
“reserve_amount”: “0B”
}
}
}

adarsh · August 1, 2025, 8:57am

Please check whether your memory subsystem is enabled in cgroups.
Also, please share the characteristics of your job and script used to run the job.

wakaka · August 1, 2025, 9:11am

The memory subsystem, “enabled”: false:
“memory”: {
“enabled”: false,
“exclude_hosts”: :
“exclude_vntypes”: ,
“soft_limit”: false,
“enforce_default”: false,
“exclhost_ignore_default”: false,
“default”: “256MB”,
“reserve_percent”: 0,
“reserve_amount”: “0GB”
}
If I enabled:true, when I submit a job without explicitly setting memory usage, I should receive the default memory limit. However, I want to use the full memory of the node. How should I handle this?

*.pbs:
#!/bin/bash
#PBS -N 1069
#PBS -k oed
#PBS -l select=1:ncpus=18:mpiprocs=18+1:ncpus=18:mpiprocs=18
#PBS -q q200
#PBS -o ./front_m.out
#PBS -e ./front_m.err
cd $PBS_O_WORKDIR
/opt/software/abaqus/Commands/abq6144 job=front_m cpus=36 int

adarsh · August 1, 2025, 11:13am

Please update this line

#PBS -l select=1:ncpus=18:mpiprocs=18+1:ncpus=18:mpiprocs=1

to

#PBS -l select=1:ncpus=18**:mem=100gb**:mpiprocs=18+1:ncpus=18:mem=100gb:mpiprocs=18

or with the below two lines

#PBS -l select=2:ncpus=18:mem=100gb:mpiprocs=18
#PBS -l place=scatter

wakaka · August 4, 2025, 1:17am

However, I don’t have a node with more than 100GB. If I submit the job, there will be no available resources and it will be always in a queued state. And if I need to enable the memory subsystem to true?

Now I find a new problem. When I enable the cgroups, I select a number of cpus to a job, but it just use one cpu in execution node to run this job. Is there something wrong in my settings?

adarsh · August 4, 2025, 9:13am

Sorry, that was an example of memory request. Please choose the memory suitable for your compute node configuration. Yes, please enable the memory subsystem to true.

the cgroups should give the exact resource that is requested in your qsub statement. I do not see any issue with your configuration.

wakaka · August 7, 2025, 3:29am

Unfortunately, the fact is that my program starts many processes, but they are all limited to running on one CPU. They are all competing for this one CPU. Can I let the process monopolize one CPU?

adarsh · August 11, 2025, 10:05am

Please check whether this is application related. This might not have anything to do with the scheduler. Try to run the application batch command line with the pre-defined hosts list and check whether the cpu cores are evenly used or context switching happens and they always use the same cores.

wakaka · August 12, 2025, 2:23am

I found that this should be a problem with the software version. I changed the version and executed the job without this problem. In addition, even if I turned on the memory subsystem, after multiple nodes completed a job, I still could not see the cpu and memory usage statistics using qstat -xf job_id

dtalcott · August 12, 2025, 8:23pm

IIRC, remote resource usage is collected only every two minutes or so. Try running a longer job and see if you get reasonable memory use numbers.

wakaka · August 13, 2025, 1:05am

Well, the job finished after running for more than 15 hours. This problem does not occur when I run the job on a single node.

Topic		Replies	Views
Memory restriction on all nodes Users/Site Administrators	5	1009	September 22, 2021
Used memory display problem Users/Site Administrators	5	117	May 30, 2024
PSS instead of RSS as mem Users/Site Administrators	11	670	August 10, 2021
Wrong amount of memory usage Developers	0	680	October 27, 2019
Pbsnodes -a does not show the memory usage Users/Site Administrators	2	879	June 29, 2022

The memory used by multiple nodes is not displayed

Related topics