Compute nodes down - PBS (inactive dead)

Dear Team,
Recently pbsnodes -a shows compute nodes are down.
So I ssh into one of those nodes. When I run systemctl status pbs , it shows inactive(dead).

I restarted the machine, and also run systemctl restart pbs. The compute node s came back online and after a few minutes, pbsnodes -a showed the nodes were down again.
Each time i run systemctl restart pbs , the nodes come online a few minutes and then go down again.

Please advice.

On the compute node, please check

  • $PBS_HOME/mom logs/ #mom logs for the day it happened
  • /var/log/messages # same here
  • you might have to run pbs_mom under strace and find out the reason (or share it for the community to share their feedback)

@adarsh
Here is log for /var/log/messages

Also can you show me how to run pbs_mom under strace?

Try re-installing the pbs_mom on the compute node and check whether it recurs again.
Otherwise,
On the compute node

  1. stop the pbs services
  2. ps -ef | grep pbs_mom # make sure no pbs_mom process is running
  3. source /etc/pbs.conf
  4. strace -o pbs_mom.txt $PBS_EXEC/bin/pbs_mom

Hi @adarsh. I reinstalled pbs. And now the compute nodes are online.

I will monitor for a while to see how it goes.

Thanks

1 Like