Output of the log file seems suspicious

Hi All,
To me, it seems like my system is running OK, but the output of the mom_log file below made me think that something may be wrong. All the slave nodes have similar outputs as below.

I would appreciate it if someone could take a look and tell me that the output is OK and there is nothing to worry about.

03/31/2022 23:46:31;0004;pbs_mom;Fil;2906.hep-node0.ER;lost connection
03/31/2022 23:46:31;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in is_child_path, Failed to allocate memory
03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;Job files not copied:---->>>>
03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;Unable to copy file /var/spool/pbs/spool/2906.hep-node0.ER to comcomproxy1.com.com:/dev/null

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;>>> error from copy

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;comcomproxy1.com.com: Connection timed out

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;proxy1.com.com, user ali_0, command scp -v -r -p -t /dev/null

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;debug1: Reading configuration data /etc/ssh/ssh_config

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;debug1: /etc/ssh/ssh_config line 61: Applying options for *

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;debug1: Connecting to comcomproxy1.com.com [45.11.57.36] port 22.

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;debug1: connect to address 45.11.57.36 port 22: Connection timed out

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;ssh: connect to host comcomproxy1.com.com port 22: Connection timed out

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;lost connection

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;>>> end error output

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;Output retained on that host in: /var/spool/pbs/undelivered/2906.hep-node0.ER

03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;---->>>>
03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;staged 2 items out over 0:09:02
03/31/2022 23:46:31;0008;pbs_mom;Job;2906.hep-node0;no active tasks
03/31/2022 23:46:31;0008;pbs_mom;Job;2921.hep-node0;no active tasks
03/31/2022 23:46:31;0080;pbs_mom;Req;req_reject;Reject reply code=15051, aux=0, type=54, from root@192.168.1.1:15001
03/31/2022 23:46:31;0100;pbs_mom;Job;2906.hep-node0;Obit sent
03/31/2022 23:46:31;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:46:31;0080;pbs_mom;Job;2906.hep-node0;delete job request received
03/31/2022 23:46:31;0008;pbs_mom;Job;2906.hep-node0;kill_job
03/31/2022 23:46:31;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:46:31;0080;pbs_mom;Job;2906.hep-node0;delete job request received
03/31/2022 23:46:31;0080;pbs_mom;Req;req_reject;Reject reply code=15001, aux=0, type=6, from root@192.168.1.1:15001
03/31/2022 23:46:31;0008;pbs_mom;Job;2921.hep-node0;no active tasks
03/31/2022 23:46:31;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:46:31;0100;pbs_mom;Req;;Type 3 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:46:31;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:46:31;0008;pbs_mom;Job;2956.hep-node0;Started, pid = 26231
03/31/2022 23:47:58;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/2921.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=2
03/31/2022 23:50:16;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/2921.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=3
03/31/2022 23:50:42;0008;pbs_mom;Job;2921.hep-node0;no active tasks
03/31/2022 23:50:42;0080;pbs_mom;Job;2956.hep-node0;task 00000001 terminated
03/31/2022 23:50:42;0008;pbs_mom;Job;2956.hep-node0;Terminated
03/31/2022 23:50:42;0100;pbs_mom;Job;2956.hep-node0;task 00000001 cput=00:04:11
03/31/2022 23:50:42;0008;pbs_mom;Job;2956.hep-node0;kill_job
03/31/2022 23:50:42;0100;pbs_mom;Job;2956.hep-node0;hep-node0 cput=00:04:11 mem=12708kb
03/31/2022 23:50:42;0100;pbs_mom;Job;2956.hep-node0;Obit sent
03/31/2022 23:50:42;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:50:42;0080;pbs_mom;Job;2956.hep-node0;copy file request received
03/31/2022 23:52:24;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/2921.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=4
03/31/2022 23:52:45;0001;pbs_mom;Fil;copy_file;sys_copy failed with status=1
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;Unable to copy file /var/spool/pbs/spool/2921.hep-node0.OU to comcomproxy1.com.com:/dev/null
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;comcomproxy1.com.com: Connection timed out
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;proxy1.com.com, user ali_0, command scp -v -r -p -t /dev/null
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;debug1: Reading configuration data /etc/ssh/ssh_config
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;debug1: /etc/ssh/ssh_config line 61: Applying options for *
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;debug1: Connecting to comcomproxy1.com.com [45.11.57.36] port 22.
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;debug1: connect to address 45.11.57.36 port 22: Connection timed out
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;ssh: connect to host comcomproxy1.com.com port 22: Connection timed out
03/31/2022 23:52:45;0004;pbs_mom;Fil;2921.hep-node0.OU;lost connection
03/31/2022 23:52:45;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in is_child_path, Failed to allocate memory
03/31/2022 23:52:50;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/2956.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=1
03/31/2022 23:54:52;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/2921.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=1
03/31/2022 23:54:57;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/2956.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=2
03/31/2022 23:56:59;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/2921.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=2
03/31/2022 23:57:15;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/2956.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=3
03/31/2022 23:57:23;0008;pbs_mom;Job;2921.hep-node0;no active tasks
03/31/2022 23:57:23;0008;pbs_mom;Job;2956.hep-node0;no active tasks
03/31/2022 23:57:23;0080;pbs_mom;Job;2937.hep-node0;task 00000001 terminated
03/31/2022 23:57:23;0008;pbs_mom;Job;2937.hep-node0;Terminated
03/31/2022 23:57:23;0100;pbs_mom;Job;2937.hep-node0;task 00000001 cput=00:23:19
03/31/2022 23:57:23;0008;pbs_mom;Job;2937.hep-node0;kill_job
03/31/2022 23:57:23;0100;pbs_mom;Job;2937.hep-node0;hep-node0 cput=00:23:19 mem=12944kb
03/31/2022 23:57:23;0100;pbs_mom;Job;2937.hep-node0;Obit sent
03/31/2022 23:57:23;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
03/31/2022 23:57:23;0080;pbs_mom;Job;2937.hep-node0;copy file request received
03/31/2022 23:59:17;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/2921.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=3
03/31/2022 23:59:23;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/2956.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=4
03/31/2022 23:59:30;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/2937.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=1
03/31/2022 23:59:44;0001;pbs_mom;Fil;copy_file;sys_copy failed with status=1
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;Unable to copy file /var/spool/pbs/spool/2956.hep-node0.OU to comcomproxy1.com.com:/dev/null
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;comcomproxy1.com.com: Connection timed out
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;proxy1.com.com, user ali_0, command scp -v -r -p -t /dev/null
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;debug1: Reading configuration data /etc/ssh/ssh_config
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;debug1: /etc/ssh/ssh_config line 61: Applying options for *
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;debug1: Connecting to comcomproxy1.com.com [45.11.57.36] port 22.
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;debug1: connect to address 45.11.57.36 port 22: Connection timed out
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;ssh: connect to host comcomproxy1.com.com port 22: Connection timed out
03/31/2022 23:59:44;0004;pbs_mom;Fil;2956.hep-node0.OU;lost connection
03/31/2022 23:59:44;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in is_child_path, Failed to allocate memory

  1. make sure passwordless ssh/scp for all the users is working seamlessly without asking for password or confirmation of host key checking
  2. edit /etc/pbs.conf on the pbs server host and compute node(s) and add these lines
    PBS_RCP=/bin/false
    PBS_SCP=/bin/scp
    in the same order and remove duplicate entries if any , restart the pbs services after making above updates
1 Like

Thank you @adarsh. I noticed that one of the nodes is having a problem with scp. Currently inside /etc/pbs.conf file, only the last item you indicated above exists. And seems everything works fine so far. I am not sure whether I need to add the first item then?

Thank you. It is better to have
PBS_RCP=/bin/false before or above PBS_SCP line , to avoid openpbs trying to use rcp.

Well, I am having again the same output in the log files even though all nodes can communicate via ssh without a password. I placed the above two lines in the same order inside pbs.conf file as well.

04/06/2022 12:51:25;0080;pbs_mom;Job;4048.hep-node0;copy file request received
04/06/2022 12:51:25;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/4048.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=1
04/06/2022 12:51:25;0080;pbs_mom;Fil;sys_copy;command: /bin/false -rp /var/spool/pbs/spool/4048.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=2
04/06/2022 12:51:36;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/4048.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=3
04/06/2022 12:51:36;0080;pbs_mom;Fil;sys_copy;command: /bin/false -rp /var/spool/pbs/spool/4048.hep-node0.OU ali_0@comcomproxy1.com.com:/dev/null status=1, try=4
04/06/2022 12:51:57;0001;pbs_mom;Fil;copy_file;sys_copy failed with status=1
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;Unable to copy file /var/spool/pbs/spool/4048.hep-node0.OU to comcomproxy1.com.com:/dev/null
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;Executing: program /usr/bin/ssh host comcomproxy1.com.com, user ali_0, command scp -v -r -p -t /dev/null
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;debug1: Reading configuration data /etc/ssh/ssh_config
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;debug1: /etc/ssh/ssh_config line 60: Applying options for *
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;ssh: Could not resolve hostname comcomproxy1.com.com: Name or service not known
04/06/2022 12:51:57;0004;pbs_mom;Fil;4048.hep-node0.OU;lost connection
04/06/2022 12:51:57;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in is_child_path, Failed to allocate memory
04/06/2022 12:51:57;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/4048.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=1
04/06/2022 12:51:57;0080;pbs_mom;Fil;sys_copy;command: /bin/false -rp /var/spool/pbs/spool/4048.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=2
04/06/2022 12:52:08;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/4048.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=3
04/06/2022 12:52:08;0080;pbs_mom;Fil;sys_copy;command: /bin/false -rp /var/spool/pbs/spool/4048.hep-node0.ER ali_0@comcomproxy1.com.com:/dev/null status=1, try=4

Please check this
Please also share your /etc/pbs.conf and /etc/hosts file
Make sure hostname command retuns the correct hostname

Below is the output of pbs.conf file from master node. Last two lines are the same across all the nodes.
PBS_EXEC=/opt/pbs
PBS_SERVER=hep-node0
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_RCP=/bin/false
PBS_SCP=/bin/scp

hosts files are the same and in the same order across all the nodes as well.
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.107 hep-node7 ali_0
192.168.1.106 hep-node6 ali_0
192.168.1.1 hep-node0 ali_0
192.168.1.102 hep-node2 ali_0
192.168.1.105 hep-node5 ali_0
192.168.1.104 hep-node4 ali_0
192.168.1.101 hep-node1 ali_0
192.168.1.103 hep-node3 ali_0

The alias ali_0 attached to each of the IP , could you remove it.

Also, please check your /etc/ssh/ssh_config file on the below message