Qsub/qstat slow (or failing) with thousands of jobs submitted

caldodge · July 31, 2020, 3:57pm

We have a cluster where a user is submitting many jobs (sometimes hundreds or more) via a Python script. Frequently when he’s running this script, job submission will fail, or jobs will be submitted but then be in an error state. Also, at times qstat fails when so many jobs are submitted at once.
The head node is a Dell Poweredge R630 with 64G of RAM, and 16 real CPU cores (discounting hyperthreading). It’s running Centos 7.7.1908, and the OpenPBS version is 19.1.3-100340.
Based on sar output, the head node is not overloaded from a perspective of CPU, RAM, or disk I/O.
The only possibly overloaded component I can identify is pbs_server.bin, which sometimes goes to 100% of CPU, but never over, suggesting it’s single-threaded.

Are there any configuration changes we need to make to allow thousands of jobs to be submitted quickly? Or do we need to get the user to slow down the rate of job submission?

My apologies if this is addressed somewhere in documentation. I couldn’t locate anything relevant to this issue.

subhasisb · August 28, 2020, 6:20am

Can you show some logs or some more information about the kind of jobs submitted?
PBS should be able to stat jobs pretty fast. Which version of PBS are u running?

On master there were some bugs that got fixed recently, perhaps u could give it another shot if you were testing on master

caldodge · August 28, 2020, 3:17pm

Thanks for your reply. It turns to not have been an issue with PBS. It was the combination of two things:

The user runs thousands of jobs every day, so there were around 260,000 jobs in history
The cluster manager (Bright) was running “qstat -x -Fjson” EVERY TWO MINUTES, generating about 1.1G of output, and overloading pbs_server.bin

Bright provided us with patched rpms which fixed the problem, and qstat has been working normally ever since.

pfs · April 20, 2021, 8:00pm

We are running into what seems like a similar behavior as the original poster. We have an automated test system that on occasion will submit on the order of 40+k jobs in a very short order. Job history typically sits at around 125K jobs.

When this rush of jobs gets queued, qstat/qsub hang and sometimes the following error gets thrown:

pbs_iff: error returned: 15019
pbs_iff: Invalid credential

We are running 19.1.1. Subhasis mentioned some recent bug fixes in this area. Any chance you can point them to me? I presume that they are rolled up into the 20.0 release?

Thanks!

agrawalravi90 · April 22, 2021, 3:56pm

I’m not sure about the particular bugfixes that’ll fix issues with qsub/qstat hanging up, but the latest master has been tested to be responsive even with a million plus jobs in the system, so I encourage you to try out master if possible.

roy · June 22, 2021, 2:51pm

Hi Caldoge,
Can you please share the patched rpm with the fix?
We are running into the same issue and it would be really nice if we can have the fix as well.

Thanks,
Roy

caldodge · July 29, 2021, 3:52pm

We did not patch the rpm, or upgrade PBS Pro. The answer was to disable all Bright monitoring of PBS, and also quadrupling the memory in the head node.

roy · July 29, 2021, 5:04pm

Thanks @caldodge , what is Bright monitoring?

caldodge · July 29, 2021, 5:19pm

The cluster is managed with Bright Cluster Manager. Bright has the ability to track jobs, so users can check on them in a convenient web portal. This is a Bad Idea when users are running tens of thousands of jobs a day (Bright was running “qstat -x -Fjson” every couple of minutes, generating a JSON result larger than 1 gigabyte).

roy · July 29, 2021, 5:58pm

This is interesting, can you share how to access the web page and how to disable it?

caldodge · July 29, 2021, 6:00pm

It has to be disabled in Bright configuration, rather than in a web page. I will see if I can dig up what we did to achieve that result.

Topic		Replies	Views
Pbs v19 qstat show errors when attached long job list Users/Site Administrators	2	638	June 14, 2019
Qsub set maximum number of job by me Users/Site Administrators	3	4125	April 26, 2019
Max running Job Users/Site Administrators	12	1235	August 11, 2021
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1637	June 18, 2019
Is there any way to submit lots of jobs in one time Users/Site Administrators	1	729	November 9, 2018

Qsub/qstat slow (or failing) with thousands of jobs submitted

Related topics