Restart pbs due to submitting too many jobs

tom · November 11, 2024, 8:48pm

Hi All,
I accidentally submitted 1000 job arrays of length 1000 (big mistake) in our group cluster. When I noticed this, the scheduler seemed to be overflowing, and I could not even do qstat. The pbs_server occupied nearly 80% of the CPU. I tried several methods to remove all the jobs, but all failed. My group manager stopped this process by killing the pbs_server with root access. Now our problem is that when we restart the compute cluster, although pbs_server is running (we restarted that too), qstat gives the following error:
qstat: cannot connect to server XXX (errno=111) Connection refused
qstat: Error (111 - Connection refused)
Neither me or my group leader are professional cluster manager, so my question is what could be wrong here…should we reinstall PBS and torque to solve this? Also, does kill pbs_server removed all the submitted jobs?
our system is Rocks 7.0 (Manzanita).
Thanks for any helps or suggestions!

adarsh · November 12, 2024, 7:40am

Hi,

Please check the below

pbs_server service is up and running
strace the qstat command to find out the reason
firewall, selinux, DNS are all intact and not down
check the pbs server logs

Please refer
Pbsnodes: cannot connect to server , error=111 and Failed to start PBS dataservice

Thank you

Topic		Replies	Views
Qstat -B : connection refused. qstat cannot connect to server Users/Site Administrators	4	5400	November 15, 2019
Installation problem: cannot connect to server (errorno=111) Users/Site Administrators	1	9275	March 20, 2018
Qstat: cannot connect to server (Single node cluster) Users/Site Administrators	5	8368	January 22, 2020
Can't start pbs_server Users/Site Administrators	4	2099	June 8, 2020
Connection refused qmgr: cannot connect to server Users/Site Administrators	6	675	January 8, 2024

Restart pbs due to submitting too many jobs

Related topics