Pbs_server.bin / pbs_dataservice consuming memory

Hi all,

I have a situation I do understand: for the past two weeks the pbs_server process on my primary server has been using increasing amounts of memory until it was noticed by the alerting system. I have been trying to understand what is going on for the last 3 days.

The situation is as follows:

  • OpenPBS 22.05 running on Rocky Linux 8.8
  • primary (currently active) and secondary server are Virtual Machines with 6 GB of RAM and 4 vCPUs
  • spool is NFS exported with defaults (and no sync)
  • there are between 100 and 200 jobs on average in queues (now ~150)

sytemctl says:

● pbs.service - Portable Batch System
   Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2024-02-07 15:55:25 CET; 3 months 2 days ago
     Docs: man:pbs(8)
    Tasks: 11
   Memory: 4.0G
   CGroup: /system.slice/pbs.service
           ├─979264 /opt/pbs/sbin/pbs_comm
           ├─979279 /opt/pbs/sbin/pbs_sched
           ├─979348 /opt/pbs/sbin/pbs_ds_monitor monitor
           ├─979397 /usr/bin/postgres -D /mnt/pbs_share/pbs_spool/datastore -p 15007
           ├─979408 postgres: logger process
           ├─979410 postgres: checkpointer process
           ├─979411 postgres: writer process
           ├─979412 postgres: wal writer process
           ├─979413 postgres: autovacuum launcher process
           ├─979414 postgres: stats collector process
           ├─979415 postgres: bgworker: logical replication launcher
           ├─979470 postgres: postgres pbs_datastore 192.168.10.55(51886) idle
           └─979472 /opt/pbs/sbin/pbs_server.bin
[someone@primary] # pbs_dataservice status
PBS data service running locally

If l look into specific PIDs I see that individual PostgreSQL processes are using a fair amount of mem out of what free says:

[someone@primary] # for pid in $(ps aux |grep postgres | awk '$1!~"root"{print $2}'); do pmap $pid | grep total; done | awk '{printf "%d\n",$2}'
458672
309388
458824
458672
458672
458672
313588
458672
473340
[someone@primary] # for pid in $(ps aux |grep postgres | awk '$1!~"root"{print $2}'); do pmap $pid | grep total; done | awk '{printf "%d\n",$2}' | awk '{a+=$1}END{print a}'
3848500
[someone@primary] # pmap 979472 | grep total
total          4401120K
[someone@primary] # free -h
              total        used        free      shared  buff/cache   available
Mem:          5.5Gi       4.5Gi       457Mi       407Mi       603Mi       428Mi
Swap:            0B          0B          0B

there is no big r/w shown by iotop. Looking in PBS and system logs I found no errors or messages hinting at something.

When I incurred in the problem I noticed that there were many held jobs due to errors and in situations
like can never run and, hoping it was related, but no changes were visible.
The system is running fine (and did not try to force anything) but the situation is worrysome.
So I have the following questions:

  • what I can look for in logs to get some hints?
  • what I can look for in postgres?
  • what if I restart the pbs_dataservice; I may have some downtime with jobs suspended (some lost perhaps) but would it help?
  • what if, on the other hand, I do a manual switch to secondary (and by the way there is a best way to that here among the possible alternatives)?

thank you in advance for your time

When I incurred in the problem I noticed that there were many held jobs due to errors and in situations
like can never run and, hoping it was related, but no changes were visible

The reason for held jobs can be found by checking the tracejob which give you information on which node it was scheduled to run, go to that node and check its mom logs , it will show you the rason for held
Usually
It is related to user authentication on the compute node or user home directory missing or not mounted on the compute nodes or user does not exists on the compute node

Can never run situation arrives : when the user had made a resource request that cannot be satified by the resources of the cluster at that point in time.

  • it is always recommended to host $PBS_HOME on a SSD or on high speed disks, if in case you are looking for a performance and submitting thousands of jobs. Always make sure NFS locking is enabled on the $PBS_HOME in a failover setup, otherwise, it would cause issues and render the datastore unusable leading to fresh installation.

With the above provided information, i do not see any issues in your setup. However, you can delete all the held jobs and other jobs which you think are not necessary and then switch over to secondary server and check whether booting the primary VM (as it is provisioned resource in a multitenant enviroment) .

Hi Adarsh,

thank you for your answer. Yes, I did check the reasons that stopped those jobs and they were due to user’s errors not of problems in the configuration; my question was that many of those could cause an abnormal consumption of memory since the total number of jobs was not of different magnitude as compared to past weeks.

The number of jobs present in the queues at any given moment is not very high (say 200) and, until now, we did not experience issues that could related to disk performance.

Even in this case, disk performance and logging did not seem to be affected: the primary server was working fine even if slowed by all the memory occupied. Moreover while the memory remained allocated after manually cleaning up the held jobs it remained stable; I do not know if in the end it was useful.

Right now I have:

  • stopped scheduling
  • stopped primary and trigger failover
  • secondary active for a while
  • started scheduling, memory in line with previous consumption
  • stopped scheduling
  • primary takeover
  • started scheduling, also here memory in line with previous consumption

So now everything seems to be fine but I am still wondering on what happened. Any ideas?
Thank you for your support

Best,

G