Hi All,
I accidentally submitted 1000 job arrays of length 1000 (big mistake) in our group cluster. When I noticed this, the scheduler seemed to be overflowing, and I could not even do qstat. The pbs_server occupied nearly 80% of the CPU. I tried several methods to remove all the jobs, but all failed. My group manager stopped this process by killing the pbs_server with root access. Now our problem is that when we restart the compute cluster, although pbs_server is running (we restarted that too), qstat gives the following error:
qstat: cannot connect to server XXX (errno=111) Connection refused
qstat: Error (111 - Connection refused)
Neither me or my group leader are professional cluster manager, so my question is what could be wrong here…should we reinstall PBS and torque to solve this? Also, does kill pbs_server removed all the submitted jobs?
our system is Rocks 7.0 (Manzanita).
Thanks for any helps or suggestions!
Hi,
Please check the below
- pbs_server service is up and running
- strace the qstat command to find out the reason
- firewall, selinux, DNS are all intact and not down
- check the pbs server logs
Please refer
Pbsnodes: cannot connect to server , error=111 and Failed to start PBS dataservice
Thank you