MOM Config incomplete

Hi All,

im a little stuck or i have confused myself completely now with a brand new issue from my previous ones!

So i have the pbs server running and all is well the config =
PBS.CONF output

PBS_SERVER=testserver.com
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1 (edit: 0 to 1)
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=56
PBS_SCP=/bin/scp

now i have the execution node setup and running (as in the service is green nothing more :unamused:) and heres my config for in in mom_priv

$clienthost server head
$restrict_user_maxsysid 999
$clienthost serverhead.test.com

And thats it is this correct as im getting no response when i run pbsnodes -a,
it tells me that the node is down.

netstat -n |grep 15003 gives me nothing some times a time_wait

i know im missing something simple but i just cant see the wood through the trees!

Cheers
Timbo

PBS_SERVER should be set to the hostname of your server. You have it set to ā€œserver headā€.

PBS_CORE_LIMIT should be set to the maximum allowed size for core files, or the string ā€œunlimitedā€ (without quotes). For example, PBS_CORE_LIMIT=536870912 sets the limit to 512MB.

$clienthost should be set to the hostname if the server. Again, you have specified ā€œserver headā€.

$restrict_user_maxsysid defaults to 999. It doesnā€™t harm anything to specify the default, but it isnā€™t necessary.

sorry the name server head really means testserver.com.

in the mom config the clienthost is that the execution node name or the servers name as i have the pbs_server=testserver.com in the pbs.conf already?

will change the core limit.

im just strugling to the the execution node to speak to the head server node (testserver.com).

Are there messages in the MoM log file indicating what the problem may be?

PBS_START_MOM=0 should be PBS_START_MOM=1 on the pbs.conf file.

sorry that is set to ā€œ1ā€ i must have miss typed it.

01/19/2017 09:05:35;0002;pbs_mom;Svr;Log;Log opened
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;pbs_version=14.0.1
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
01/19/2017 09:05:35;0100;pbs_mom;Svr;parse_config;file config
01/19/2017 09:05:35;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
01/19/2017 09:05:35;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 1024
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Max files too low - you may want to increase it.
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant mode disabled
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm testserver.COM
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.20.2 as authorized
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.20.1 as authorized
01/19/2017 09:05:35;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
01/19/2017 09:05:35;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
01/19/2017 09:05:35;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:35;0002;pbs_mom;n/a;ncpus;hyperthreading enabled
01/19/2017 09:05:35;0002;pbs_mom;n/a;initialize;pcpus=56, OS reports 56 cpu(s)
01/19/2017 09:05:35;0006;pbs_mom;Fil;pbs_mom;Version 14.0.1, started, initialization type = 0
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Mom pid = 18482 ready, using ports Server:15001 MOM:15002 RM:15003
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:35;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:35;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:37;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:37;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:37;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:37;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:37;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:37;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:39;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:39;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:39;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:39;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:39;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:39;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:41;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:41;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
01/19/2017 09:05:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.20.1:15003 to pbs_comm
01/19/2017 09:05:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm testserver.COM
01/19/2017 09:05:43;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/19/2017 09:05:43;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at testserver.COM:15001
01/19/2017 09:05:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm testserver.COM down
01/19/2017 09:05:43;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called

Sorry should have posted this earlier but its a task to remove stuff from a airgapped network.

Testserver.com is the head node the above log is from the testmom which is setup on a separate node for execution only.

could you please check if

  1. there is an entry for testserver.com in /etc/hosts file on the hosts.
  2. the firewall is blocking the communication.
  1. yes its in the hosts file
  2. firewall is off

I am just wondering why the ā€œcomā€ in testserver.com is capitalized in the mom logsā€¦

its been edited it should be lowcase