Qsub/qstat slow (or failing) with thousands of jobs submitted

We have a cluster where a user is submitting many jobs (sometimes hundreds or more) via a Python script. Frequently when he’s running this script, job submission will fail, or jobs will be submitted but then be in an error state. Also, at times qstat fails when so many jobs are submitted at once.
The head node is a Dell Poweredge R630 with 64G of RAM, and 16 real CPU cores (discounting hyperthreading). It’s running Centos 7.7.1908, and the OpenPBS version is 19.1.3-100340.
Based on sar output, the head node is not overloaded from a perspective of CPU, RAM, or disk I/O.
The only possibly overloaded component I can identify is pbs_server.bin, which sometimes goes to 100% of CPU, but never over, suggesting it’s single-threaded.

Are there any configuration changes we need to make to allow thousands of jobs to be submitted quickly? Or do we need to get the user to slow down the rate of job submission?

My apologies if this is addressed somewhere in documentation. I couldn’t locate anything relevant to this issue.

Can you show some logs or some more information about the kind of jobs submitted?
PBS should be able to stat jobs pretty fast. Which version of PBS are u running?

On master there were some bugs that got fixed recently, perhaps u could give it another shot if you were testing on master

Thanks for your reply. It turns to not have been an issue with PBS. It was the combination of two things:

  1. The user runs thousands of jobs every day, so there were around 260,000 jobs in history
  2. The cluster manager (Bright) was running “qstat -x -Fjson” EVERY TWO MINUTES, generating about 1.1G of output, and overloading pbs_server.bin

Bright provided us with patched rpms which fixed the problem, and qstat has been working normally ever since.

1 Like

We are running into what seems like a similar behavior as the original poster. We have an automated test system that on occasion will submit on the order of 40+k jobs in a very short order. Job history typically sits at around 125K jobs.

When this rush of jobs gets queued, qstat/qsub hang and sometimes the following error gets thrown:

pbs_iff: error returned: 15019
pbs_iff: Invalid credential

We are running 19.1.1. Subhasis mentioned some recent bug fixes in this area. Any chance you can point them to me? I presume that they are rolled up into the 20.0 release?

Thanks!

I’m not sure about the particular bugfixes that’ll fix issues with qsub/qstat hanging up, but the latest master has been tested to be responsive even with a million plus jobs in the system, so I encourage you to try out master if possible.

Hi Caldoge,
Can you please share the patched rpm with the fix?
We are running into the same issue and it would be really nice if we can have the fix as well.

Thanks,
Roy

We did not patch the rpm, or upgrade PBS Pro. The answer was to disable all Bright monitoring of PBS, and also quadrupling the memory in the head node.

Thanks @caldodge , what is Bright monitoring?

The cluster is managed with Bright Cluster Manager. Bright has the ability to track jobs, so users can check on them in a convenient web portal. This is a Bad Idea when users are running tens of thousands of jobs a day (Bright was running “qstat -x -Fjson” every couple of minutes, generating a JSON result larger than 1 gigabyte).

This is interesting, can you share how to access the web page and how to disable it?

It has to be disabled in Bright configuration, rather than in a web page. I will see if I can dig up what we did to achieve that result.