Server hangs when submitting job using "afterany" dependency

Hello,

When submitting a job that depends on ~260 other jobs, (qsub -Wdepend=afterany:1:260 script.sh), everything works as expected. Once there are ~275 or more jobs in the arguments list, the server hangs and commands such as qstat respond with this error:

Cannot connect to PBS server; Unknown error 15010

While stuck in this state, the qsub process can be found waiting on the host that it was submitted from (ps) with no corresponding job scheduled in the PBS logs.

Also, the command: “ss -ntl ‘( sport = :15001)’” shows the following:

State                  Recv-Q                 Send-Q                                 Local Address:Port                                  Peer Address:Port
LISTEN                 257                    256                                          0.0.0.0:15001                                      0.0.0.0:*

After waiting anywhere from several minutes to several hours, the port clears up, the server responds to requests again, and job gets successfully scheduled. Restarting the server also clears everything up.

This issue is reproducible running under Oracle Linux 8.10, Running the most recent version of OpenPBS built from source. Here is the specific commit as of this writing: GitHub - openpbs/openpbs at 30c29c22b52ca7aafc398d15a3702cc52b39fcf8

Is this a Bug? I can provide more information as needed to help diagnose

Could you please integrate Munge authentication and find out whether this solves the problem.

Please refer: 11.4.4 Authentication via MUNGE

Hello, I have integrated Munge authentication in my setup. But the original issue still remains with the same behavior as before.

What are your thoughts?

Could you please share the steps to reproduce this issue, including the exact sequence of actions that led to it?

Additionally, please:

  • Increase the log level on the server, scheduler, and MoM.
  • Note down the timestamp before reproducing the issue.
  • Reproduce the issue and share the PBS daemon logs from that timestamp up to the point where the issue occurs.

Ok, Here is the sequence of actions that lead up to this issue:

  1. User clicks “submit” on 3rd-party webui
  2. The user’s initial job is submitted from a submit host to the server
  3. This initial job executes a qlogin from the cluster headnode onto one of the cluster compute nodes
  4. From whichever compute node happens to accept this qlogin, a bunch of small jobs are submitted.
  5. After all of the small jobs have been submitted, the same compute node tries to submit a final job that has all of the previous small job id’s listed as arguments.

This final step is where the issue happens, but only when it depends on more than ~270 previous smaller jobs. When the final job is submitted with less than ~260ish previous jobs, everything works smoothly with no hiccups.


Here are the log samples. I removed everything before 16:45:00 and launched the initial job after 16:48:22

PBS_Server Logs

PBS_Sched Logs

PBS_Mom Logs

11/18/2025 13:57:34;0080;pbs_mom;Job;4438618[1083].headnode;task 00000001 terminated
11/18/2025 13:57:34;0008;pbs_mom;Job;4438618[1083].headnode;Terminated
11/18/2025 13:57:34;0100;pbs_mom;Job;4438618[1083].headnode;task 00000001 cput=00:24:46
11/18/2025 13:57:34;0008;pbs_mom;Job;4438618[1083].headnode;kill_job
11/18/2025 13:57:34;0100;pbs_mom;Job;4438618[1083].headnode;c4n5 cput=00:24:46 mem=127448kb
11/18/2025 13:57:34;0100;pbs_mom;Job;4438618[1083].headnode;Obit sent
11/18/2025 13:57:35;0100;pbs_mom;Req;;Type 54 request received from root@10.152.1.1:15001, sock=0
11/18/2025 13:57:35;0080;pbs_mom;Job;4438618[1083].headnode;copy file request received
11/18/2025 13:57:35;0100;pbs_mom;Job;4438618[1083].headnode;Staged 2/2 items out over 0:00:00
11/18/2025 13:57:35;0008;pbs_mom;Job;4438618[1083].headnode;no active tasks
11/18/2025 13:57:35;0100;pbs_mom;Req;;Type 6 request received from root@10.152.1.1:15001, sock=0
11/18/2025 13:57:35;0080;pbs_mom;Job;4438618[1083].headnode;delete job request received
11/18/2025 13:57:35;0008;pbs_mom;Job;4438618[1083].headnode;kill_job
11/18/2025 16:48:39;0100;pbs_mom;Req;;Type 1 request received from root@10.152.1.1:15001, sock=0
11/18/2025 16:48:39;0100;pbs_mom;Req;;Type 3 request received from root@10.152.1.1:15001, sock=0
11/18/2025 16:48:39;0100;pbs_mom;Req;;Type 5 request received from root@10.152.1.1:15001, sock=0
11/18/2025 16:48:39;0008;pbs_mom;Job;4438619.headnode;Started, pid = 1383337
11/18/2025 16:52:45;0080;pbs_mom;Job;4438619.headnode;task 00000001 terminated
11/18/2025 16:52:45;0008;pbs_mom;Job;4438619.headnode;Terminated
11/18/2025 16:52:45;0100;pbs_mom;Job;4438619.headnode;task 00000001 cput=00:00:02
11/18/2025 16:52:45;0008;pbs_mom;Job;4438619.headnode;kill_job
11/18/2025 16:52:45;0100;pbs_mom;Job;4438619.headnode;c4n5 cput=00:00:02 mem=50416kb
11/18/2025 16:52:45;0100;pbs_mom;Job;4438619.headnode;Obit sent
11/18/2025 16:53:47;0100;pbs_mom;Job;4438619.headnode;Obit sent
11/18/2025 16:54:51;0100;pbs_mom;Job;4438619.headnode;Obit sent
11/18/2025 16:56:02;0100;pbs_mom;Job;4438619.headnode;Obit sent

In this particular instance, the initial job was submitted from the compute node “c4n5”. While all of the logs stalled, I am able to observe this process running on c4n5 from the output of “ps -aef | grep qsub”.

userA 1383892 1383555  0 16:49 ?        00:00:00 /opt/pbs/bin/qsub -W depend afterany:4438620:4438621:4438622:4438623:4438624:4438625:4438626:4438627:4438628:4438629:4438630:4438631:4438632:4438633:4438634:4438635:4438636:4438637:4438638:4438639:4438640:4438641:4438642:4438643:4438644:4438645:4438646:4438647:4438648:4438649:4438650:4438651:4438652:4438653:4438654:4438655:4438656:4438657:4438658:4438659:4438660:4438661:4438662:4438663:4438664:4438665:4438666:4438667:4438668:4438669:4438670:4438671:4438672:4438673:4438674:4438675:4438676:4438677:4438678:4438679:4438680:4438681:4438682:4438683:4438684:4438685:4438686:4438687:4438688:4438689:4438690:4438691:4438692:4438693:4438694:4438695:4438696:4438697:4438698:4438699:4438700:4438701:4438702:4438703:4438704:4438705:4438706:4438707:4438708:4438709:4438710:4438711:4438712:4438713:4438714:4438715:4438716:4438717:4438718:4438719:4438720:4438721:4438722:4438723:4438724:4438725:4438726:4438727:4438728:4438729:4438730:4438731:4438732:4438733:4438734:4438735:4438736:4438737:4438738:4438739:4438740:4438741:4438742:4438743:4438744:4438745:4438746:4438747:4438748:4438749:4438750:4438751:4438752:4438753:4438754:4438755:4438756:4438757:4438758:4438759:4438760:4438761:4438762:4438763:4438764:4438765:4438766:4438767:4438768:4438769:4438770:4438771:4438772:4438773:4438774:4438775:4438776:4438777:4438778:4438779:4438780:4438781:4438782:4438783:4438784:4438785:4438786:4438787:4438788:4438789:4438790:4438791:4438792:4438793:4438794:4438795:4438796:4438797:4438798:4438799:4438800:4438801:4438802:4438803:4438804:4438805:4438806:4438807:4438808:4438809:4438810:4438811:4438812:4438813:4438814:4438815:4438816:4438817:4438818:4438819:4438820:4438821:4438822:4438823:4438824:4438825:4438826:4438827:4438828:4438829:4438830:4438831:4438832:4438833:4438834:4438835:4438836:4438837:4438838:4438839:4438840:4438841:4438842:4438843:4438844:4438845:4438846:4438847:4438848:4438849:4438850:4438851:4438852:4438853:4438854:4438855:4438856:4438857:4438858:4438859:4438860:4438861:4438862:4438863:4438864:4438865:4438866:4438867:4438868:4438869:4438870:4438871:4438872:4438873:4438874:4438875:4438876:4438877:4438878:4438879:4438880:4438881:4438882:4438883:4438884:4438885:4438886:4438887:4438888:4438889:4438890:4438891:4438892:4438893:4438894 project_post_traj.sh

The logs remain stuck like this and the process remains stuck running on c4n5.

Thank you @cszczepa for sharing the above information.

  • It is recommended to submit jobs from the login/client nodes (which have only the pbs commands) or from the PBS Server host.

  • It is not recommended to submit jobs from compute (execution) nodes.

  • There is no limit on the number of dependent jobs that can be included with qsub; however, the maximum allowed command-line length is 4095 characters, which may restrict how many dependencies you can specify.

  • I have tested by submitting 276 jobs and 1 dependent job and it worked fine for me. Could you please try submitting on the PBS Server host and not the compute node , to see whether it works.

cat testopenpbs.sh
#!/bin/bash
/bin/sleep 300

cat submit_dependeny_script.sh

#!/bin/bash
JOB_SCRIPT="testopenpbs.sh"
COUNT=275
JOBIDS=()
echo "Now submitting $COUNT jobs..."
for i in $(seq 1 $COUNT); do
    jid=$(qsub "$JOB_SCRIPT")
    if [ -z "$jid" ]; then
        echo "Error: qsub failed on job $i"
        exit 1
    fi
    echo "Submitted job $i -> $jid"
    JOBIDS+=("$jid")
done
DEPEND_LIST=$(printf ":%s" "${JOBIDS[@]}")
DEPEND_LIST=${DEPEND_LIST:1}  
echo "============="
echo "Now the dependent job..."
FINAL_JID=$(qsub -W depend=afterany:$DEPEND_LIST "$JOB_SCRIPT")
echo "Dependent job submitted -> $FINAL_JID"


As standard user on the openPBS server
save the above files in a folder 
chmod +x *.sh 
source submit_dependeny_script.sh
qstat -fx  <finaly job id>