Hi all,
I have a situation I do understand: for the past two weeks the pbs_server
process on my primary server has been using increasing amounts of memory until it was noticed by the alerting system. I have been trying to understand what is going on for the last 3 days.
The situation is as follows:
- OpenPBS 22.05 running on Rocky Linux 8.8
- primary (currently active) and secondary server are Virtual Machines with 6 GB of RAM and 4 vCPUs
- spool is NFS exported with defaults (and no
sync
) - there are between 100 and 200 jobs on average in queues (now ~150)
sytemctl
says:
● pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: active (running) since Wed 2024-02-07 15:55:25 CET; 3 months 2 days ago
Docs: man:pbs(8)
Tasks: 11
Memory: 4.0G
CGroup: /system.slice/pbs.service
├─979264 /opt/pbs/sbin/pbs_comm
├─979279 /opt/pbs/sbin/pbs_sched
├─979348 /opt/pbs/sbin/pbs_ds_monitor monitor
├─979397 /usr/bin/postgres -D /mnt/pbs_share/pbs_spool/datastore -p 15007
├─979408 postgres: logger process
├─979410 postgres: checkpointer process
├─979411 postgres: writer process
├─979412 postgres: wal writer process
├─979413 postgres: autovacuum launcher process
├─979414 postgres: stats collector process
├─979415 postgres: bgworker: logical replication launcher
├─979470 postgres: postgres pbs_datastore 192.168.10.55(51886) idle
└─979472 /opt/pbs/sbin/pbs_server.bin
[someone@primary] # pbs_dataservice status
PBS data service running locally
If l look into specific PIDs I see that individual PostgreSQL processes are using a fair amount of mem out of what free
says:
[someone@primary] # for pid in $(ps aux |grep postgres | awk '$1!~"root"{print $2}'); do pmap $pid | grep total; done | awk '{printf "%d\n",$2}'
458672
309388
458824
458672
458672
458672
313588
458672
473340
[someone@primary] # for pid in $(ps aux |grep postgres | awk '$1!~"root"{print $2}'); do pmap $pid | grep total; done | awk '{printf "%d\n",$2}' | awk '{a+=$1}END{print a}'
3848500
[someone@primary] # pmap 979472 | grep total
total 4401120K
[someone@primary] # free -h
total used free shared buff/cache available
Mem: 5.5Gi 4.5Gi 457Mi 407Mi 603Mi 428Mi
Swap: 0B 0B 0B
there is no big r/w shown by iotop
. Looking in PBS and system logs I found no errors or messages hinting at something.
When I incurred in the problem I noticed that there were many held jobs due to errors and in situations
like can never run
and, hoping it was related, but no changes were visible.
The system is running fine (and did not try to force anything) but the situation is worrysome.
So I have the following questions:
- what I can look for in logs to get some hints?
- what I can look for in postgres?
- what if I restart the
pbs_dataservice
; I may have some downtime with jobs suspended (some lost perhaps) but would it help? - what if, on the other hand, I do a manual switch to secondary (and by the way there is a best way to that here among the possible alternatives)?
thank you in advance for your time