Qsub/qstat slow (or failing) with thousands of jobs submitted

We have a cluster where a user is submitting many jobs (sometimes hundreds or more) via a Python script. Frequently when he’s running this script, job submission will fail, or jobs will be submitted but then be in an error state. Also, at times qstat fails when so many jobs are submitted at once.
The head node is a Dell Poweredge R630 with 64G of RAM, and 16 real CPU cores (discounting hyperthreading). It’s running Centos 7.7.1908, and the OpenPBS version is 19.1.3-100340.
Based on sar output, the head node is not overloaded from a perspective of CPU, RAM, or disk I/O.
The only possibly overloaded component I can identify is pbs_server.bin, which sometimes goes to 100% of CPU, but never over, suggesting it’s single-threaded.

Are there any configuration changes we need to make to allow thousands of jobs to be submitted quickly? Or do we need to get the user to slow down the rate of job submission?

My apologies if this is addressed somewhere in documentation. I couldn’t locate anything relevant to this issue.

Can you show some logs or some more information about the kind of jobs submitted?
PBS should be able to stat jobs pretty fast. Which version of PBS are u running?

On master there were some bugs that got fixed recently, perhaps u could give it another shot if you were testing on master

Thanks for your reply. It turns to not have been an issue with PBS. It was the combination of two things:

  1. The user runs thousands of jobs every day, so there were around 260,000 jobs in history
  2. The cluster manager (Bright) was running “qstat -x -Fjson” EVERY TWO MINUTES, generating about 1.1G of output, and overloading pbs_server.bin

Bright provided us with patched rpms which fixed the problem, and qstat has been working normally ever since.

1 Like