We have a cluster where a user is submitting many jobs (sometimes hundreds or more) via a Python script. Frequently when he’s running this script, job submission will fail, or jobs will be submitted but then be in an error state. Also, at times qstat fails when so many jobs are submitted at once.
The head node is a Dell Poweredge R630 with 64G of RAM, and 16 real CPU cores (discounting hyperthreading). It’s running Centos 7.7.1908, and the OpenPBS version is 19.1.3-100340.
Based on sar output, the head node is not overloaded from a perspective of CPU, RAM, or disk I/O.
The only possibly overloaded component I can identify is pbs_server.bin, which sometimes goes to 100% of CPU, but never over, suggesting it’s single-threaded.
Are there any configuration changes we need to make to allow thousands of jobs to be submitted quickly? Or do we need to get the user to slow down the rate of job submission?
My apologies if this is addressed somewhere in documentation. I couldn’t locate anything relevant to this issue.