We have a cluster where a user is submitting many jobs (sometimes hundreds or more) via a Python script. Frequently when he’s running this script, job submission will fail, or jobs will be submitted but then be in an error state. Also, at times qstat fails when so many jobs are submitted at once.
The head node is a Dell Poweredge R630 with 64G of RAM, and 16 real CPU cores (discounting hyperthreading). It’s running Centos 7.7.1908, and the OpenPBS version is 19.1.3-100340.
Based on sar output, the head node is not overloaded from a perspective of CPU, RAM, or disk I/O.
The only possibly overloaded component I can identify is pbs_server.bin, which sometimes goes to 100% of CPU, but never over, suggesting it’s single-threaded.
Are there any configuration changes we need to make to allow thousands of jobs to be submitted quickly? Or do we need to get the user to slow down the rate of job submission?
My apologies if this is addressed somewhere in documentation. I couldn’t locate anything relevant to this issue.
Can you show some logs or some more information about the kind of jobs submitted?
PBS should be able to stat jobs pretty fast. Which version of PBS are u running?
On master there were some bugs that got fixed recently, perhaps u could give it another shot if you were testing on master
We are running into what seems like a similar behavior as the original poster. We have an automated test system that on occasion will submit on the order of 40+k jobs in a very short order. Job history typically sits at around 125K jobs.
When this rush of jobs gets queued, qstat/qsub hang and sometimes the following error gets thrown:
We are running 19.1.1. Subhasis mentioned some recent bug fixes in this area. Any chance you can point them to me? I presume that they are rolled up into the 20.0 release?
I’m not sure about the particular bugfixes that’ll fix issues with qsub/qstat hanging up, but the latest master has been tested to be responsive even with a million plus jobs in the system, so I encourage you to try out master if possible.
Hi Caldoge,
Can you please share the patched rpm with the fix?
We are running into the same issue and it would be really nice if we can have the fix as well.
We did not patch the rpm, or upgrade PBS Pro. The answer was to disable all Bright monitoring of PBS, and also quadrupling the memory in the head node.
The cluster is managed with Bright Cluster Manager. Bright has the ability to track jobs, so users can check on them in a convenient web portal. This is a Bad Idea when users are running tens of thousands of jobs a day (Bright was running “qstat -x -Fjson” every couple of minutes, generating a JSON result larger than 1 gigabyte).