When submitting a job that depends on ~260 other jobs, (qsub -Wdepend=afterany:1:260 script.sh), everything works as expected. Once there are ~275 or more jobs in the arguments list, the server hangs and commands such as qstat respond with this error:
Cannot connect to PBS server; Unknown error 15010
While stuck in this state, the qsub process can be found waiting on the host that it was submitted from (ps) with no corresponding job scheduled in the PBS logs.
Also, the command: “ss -ntl ‘( sport = :15001)’” shows the following:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 257 256 0.0.0.0:15001 0.0.0.0:*
After waiting anywhere from several minutes to several hours, the port clears up, the server responds to requests again, and job gets successfully scheduled. Restarting the server also clears everything up.
Ok, Here is the sequence of actions that lead up to this issue:
User clicks “submit” on 3rd-party webui
The user’s initial job is submitted from a submit host to the server
This initial job executes a qlogin from the cluster headnode onto one of the cluster compute nodes
From whichever compute node happens to accept this qlogin, a bunch of small jobs are submitted.
After all of the small jobs have been submitted, the same compute node tries to submit a final job that has all of the previous small job id’s listed as arguments.
This final step is where the issue happens, but only when it depends on more than ~270 previous smaller jobs. When the final job is submitted with less than ~260ish previous jobs, everything works smoothly with no hiccups.
Here are the log samples. I removed everything before 16:45:00 and launched the initial job after 16:48:22
PBS_Server Logs
PBS_Sched Logs
PBS_Mom Logs
11/18/2025 13:57:34;0080;pbs_mom;Job;4438618[1083].headnode;task 00000001 terminated
11/18/2025 13:57:34;0008;pbs_mom;Job;4438618[1083].headnode;Terminated
11/18/2025 13:57:34;0100;pbs_mom;Job;4438618[1083].headnode;task 00000001 cput=00:24:46
11/18/2025 13:57:34;0008;pbs_mom;Job;4438618[1083].headnode;kill_job
11/18/2025 13:57:34;0100;pbs_mom;Job;4438618[1083].headnode;c4n5 cput=00:24:46 mem=127448kb
11/18/2025 13:57:34;0100;pbs_mom;Job;4438618[1083].headnode;Obit sent
11/18/2025 13:57:35;0100;pbs_mom;Req;;Type 54 request received from root@10.152.1.1:15001, sock=0
11/18/2025 13:57:35;0080;pbs_mom;Job;4438618[1083].headnode;copy file request received
11/18/2025 13:57:35;0100;pbs_mom;Job;4438618[1083].headnode;Staged 2/2 items out over 0:00:00
11/18/2025 13:57:35;0008;pbs_mom;Job;4438618[1083].headnode;no active tasks
11/18/2025 13:57:35;0100;pbs_mom;Req;;Type 6 request received from root@10.152.1.1:15001, sock=0
11/18/2025 13:57:35;0080;pbs_mom;Job;4438618[1083].headnode;delete job request received
11/18/2025 13:57:35;0008;pbs_mom;Job;4438618[1083].headnode;kill_job
11/18/2025 16:48:39;0100;pbs_mom;Req;;Type 1 request received from root@10.152.1.1:15001, sock=0
11/18/2025 16:48:39;0100;pbs_mom;Req;;Type 3 request received from root@10.152.1.1:15001, sock=0
11/18/2025 16:48:39;0100;pbs_mom;Req;;Type 5 request received from root@10.152.1.1:15001, sock=0
11/18/2025 16:48:39;0008;pbs_mom;Job;4438619.headnode;Started, pid = 1383337
11/18/2025 16:52:45;0080;pbs_mom;Job;4438619.headnode;task 00000001 terminated
11/18/2025 16:52:45;0008;pbs_mom;Job;4438619.headnode;Terminated
11/18/2025 16:52:45;0100;pbs_mom;Job;4438619.headnode;task 00000001 cput=00:00:02
11/18/2025 16:52:45;0008;pbs_mom;Job;4438619.headnode;kill_job
11/18/2025 16:52:45;0100;pbs_mom;Job;4438619.headnode;c4n5 cput=00:00:02 mem=50416kb
11/18/2025 16:52:45;0100;pbs_mom;Job;4438619.headnode;Obit sent
11/18/2025 16:53:47;0100;pbs_mom;Job;4438619.headnode;Obit sent
11/18/2025 16:54:51;0100;pbs_mom;Job;4438619.headnode;Obit sent
11/18/2025 16:56:02;0100;pbs_mom;Job;4438619.headnode;Obit sent
In this particular instance, the initial job was submitted from the compute node “c4n5”. While all of the logs stalled, I am able to observe this process running on c4n5 from the output of “ps -aef | grep qsub”.
Thank you @cszczepa for sharing the above information.
It is recommended to submit jobs from the login/client nodes (which have only the pbs commands) or from the PBS Server host.
It is not recommended to submit jobs from compute (execution) nodes.
There is no limit on the number of dependent jobs that can be included with qsub; however, the maximum allowed command-line length is 4095 characters, which may restrict how many dependencies you can specify.
I have tested by submitting 276 jobs and 1 dependent job and it worked fine for me. Could you please try submitting on the PBS Server host and not the compute node , to see whether it works.
cat testopenpbs.sh
#!/bin/bash
/bin/sleep 300
cat submit_dependeny_script.sh
#!/bin/bash
JOB_SCRIPT="testopenpbs.sh"
COUNT=275
JOBIDS=()
echo "Now submitting $COUNT jobs..."
for i in $(seq 1 $COUNT); do
jid=$(qsub "$JOB_SCRIPT")
if [ -z "$jid" ]; then
echo "Error: qsub failed on job $i"
exit 1
fi
echo "Submitted job $i -> $jid"
JOBIDS+=("$jid")
done
DEPEND_LIST=$(printf ":%s" "${JOBIDS[@]}")
DEPEND_LIST=${DEPEND_LIST:1}
echo "============="
echo "Now the dependent job..."
FINAL_JID=$(qsub -W depend=afterany:$DEPEND_LIST "$JOB_SCRIPT")
echo "Dependent job submitted -> $FINAL_JID"
As standard user on the openPBS server
save the above files in a folder
chmod +x *.sh
source submit_dependeny_script.sh
qstat -fx <finaly job id>