Issue with Email Sending in PBS Post-Failover

matzmz · July 12, 2024, 5:43pm

I’m encountering a really strange issue with email notifications in a PBS (Portable Batch System) setup configured with a high availability (HA) cluster consisting of two nodes: a primary and a secondary. Both nodes have Postfix installed and configured. Email sending from the command line works perfectly on both nodes using tools like mailx.

The issue arises during a failover scenario. For instance, if the primary node crashes and the system switches to the secondary node, email notifications cease to be sent. There are no entries related to the emails in the Postfix logs during this time, despite Postfix itself being operational (command line email sending still works and is logged correctly).

Additionally, if I restart the PBS service on the primary node to force a failover and ensure that the scheduler runs on the primary, the problem persists. Emails continue to not be sent, and there is still no trace of communication with the SMTP server in the logs.

Has anyone experienced similar issues or have any ideas on what might be causing this problem?

Thank you in advance for your assistance!

matzmz · July 13, 2024, 9:54am

After conducting some investigations, I found the following logs:

07/13/2024 11:06:58;0001;Server@XXX01;Svr;Server@XXX01;Cannot allocate memory (12) in svr_mailowner_id, fork failed
07/13/2024 11:31:51;0001;Server@XXX01;Svr;Server@XXX01;Cannot allocate memory (12) in svr_mailowner_id, fork failed
07/13/2024 11:31:53;0001;Server@XXX01;Svr;Server@XXX01;Cannot allocate memory (12) in svr_mailowner_id, fork failed

The problem is indicated by the failure of the fork() call, as shown here: openpbs/src/server/svr_mail.c at 81187aeceee8247a1fe82ee7f6c89a3987c1ff42 · openpbs/openpbs · GitHub.

Here is the current status of the pbs_server process:

cat /proc/815639/status
Name:	pbs_server.bin
Umask:	0022
State:	S (sleeping)
Tgid:	815639
Ngid:	0
Pid:	815639
PPid:	1
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	64
Groups:	0 
NStgid:	815639
NSpid:	815639
NSpgid:	815639
NSsid:	815639
VmPeak:	2653792 kB
VmSize:	2653792 kB
VmLck:	      0 kB
VmPin:	      0 kB
VmHWM:	2433696 kB
VmRSS:	2433692 kB
RssAnon:	2423584 kB
RssFile:	  10108 kB
RssShmem:	      0 kB
VmData:	2439804 kB
VmStk:	    132 kB
VmExe:	   1396 kB
VmLib:	  47676 kB
VmPTE:	   4996 kB
VmSwap:	      0 kB
HugetlbPages:	      0 kB
CoreDumping:	0
THP_enabled:	1
Threads:	2
SigQ:	4/22430
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000001001a00
SigCgt:	0000000180014003
CapInh:	0000000000000000
CapPrm:	000001ffffffffff
CapEff:	000001ffffffffff
CapBnd:	000001ffffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Speculation_Store_Bypass:	thread vulnerable
Cpus_allowed:	f
Cpus_allowed_list:	0-3
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:	0
voluntary_ctxt_switches:	2306044
nonvoluntary_ctxt_switches:	908849

This is the memory usage on the server:

 free -h
              total        used        free      shared  buff/cache   available
Mem:          5.5Gi       2.7Gi       460Mi       423Mi       2.3Gi       2.1Gi
Swap:            0B          0B          0B

It appears there isn’t enough memory available to ‘copy’ for the new process. Why is such a large amount of memory required? As far as I know, the Linux kernel should implement a Copy-On-Write (COW) strategy for forked processes.

vchlum · July 14, 2024, 8:04am

Hi @matzmz!

The memory consumption (btw) depends on how large your infrastructure is. For example, we have 700 nodes and currently 30k jobs in the infrastructure and after three days pbs_server.bin occupies 1.8G of mem. Please, be aware also the jobs in history occupy memory.

To resolve the issue you can consider:

reduce the history duration via server attribute job_history_duration if it is enabled (job_history_enable). You can also try to disable it, but the jobs are removed from PBS after completion.
Using hooks, the issues with memory consumption can be reduced by restarting the Python interpreter regularly. The Python interpreter has some leaks in it. Please see the manual, chapter 4.3 Restarting the Python Interpreter (HG-23). You can adjust the attributes described in the chapter. …but I think the default settings are quite OK.
Add some physical memory to your server.

Vaclav

matzmz · July 15, 2024, 6:55am

Hi @vchlum,

First of all, thank you for the detailed response you provided.

I have a curiosity: why does the job_history reside in memory? Why is it not offloaded to the database and retrieved when required?

Thanks

Topic		Replies	Views
Email not sending PBS 18.1.4 & Centos 7.6.1810 Users/Site Administrators	7	2178	August 28, 2019
Communication Failure Users/Site Administrators	7	1702	January 8, 2021
Failover Setup Issues Users/Site Administrators	10	3806	April 10, 2019
Can't access to the nodes Users/Site Administrators	4	1318	May 13, 2019
Pbs crash on primary server Users/Site Administrators	3	450	June 14, 2023

Issue with Email Sending in PBS Post-Failover

Related topics