Sporadic DIS errors in Server_Log

Seeing some sporadic DIS errors when -W block=True is requested:

./20250103:01/03/2025 14:19:32;0001;Server@sched;Svr;Server@sched;check_block_wt, DIS error while replying to client gitrun3 for job 164004.sched

The job exits cleaning (status 0), but the submission terminal hangs as it didn’t receive a response from the Scheduler.

I see this in the MOM logs:
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;task 00000001 terminated
01/28/2025 11:02:39;0800;pbs_mom;n/a;mom_get_sample;nprocs: 615, cantstat: 2, nomem: 0, skipped: 492, cached: 0
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;Terminated
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;task 00000001 cput=00:13:00
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;kill_job
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;node9 cput=00:13:00 mem=1275332kb
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;Obit sent
01/28/2025 11:02:39;0100;pbs_mom;Req;;Type 54 request received from root@10.10.38.18:15001, sock=5
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;copy file request received
01/28/2025 11:02:39;0800;pbs_mom;Job;stage_file;Skipping directly written/absent spool file /var/spool/pbs/spool/179807.sched.OU
01/28/2025 11:02:39;0800;pbs_mom;Job;stage_file;Skipping directly written/absent spool file /var/spool/pbs/spool/179807.sched.ER
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;staged 2 items out over 0:00:00
01/28/2025 11:02:39;0800;pbs_mom;n/a;mom_get_sample;nprocs: 616, cantstat: 2, nomem: 0, skipped: 492, cached: 0
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;no active tasks
01/28/2025 11:02:39;0100;pbs_mom;Req;;Type 6 request received from root@10.10.38.18:15001, sock=5
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;delete job request received
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;kill_job

This reminds me of a problem that used to exist if Security ran a network port scan against the qsub host while qsub was blocked. In testing just now, I was able to reproduce the qsub hang, but not the error message on the server.

Can you check logs to see if a port scan was running at the times of the DIS errors?

Unfortunately i can’t find any evidence of a port scan in the logs. I did notice server_logs complaining about a low file limit (1024), and i increased that to 65536 which seems to have resolved the issue.

I tried to reproduce by running a repetitive netcat to the TCP ports used by qsub on the submission host, but wasn’t able to.

Just in case, the pbs_comm daemons open file limit is dependent on the limits set at the operating system level. Having mentioned this , i have not seen this error regarding DIS , most of it was related to 1) what @dtalcott mentioned 2) different versions of the openpbs client commands in the mix or two incomptiable versions of the daemons.

Our PBS environment is homogenous at 20.0.1, we don’t have other versions installed. @dtalcott would you mind sharing how you were able to reproduce the DIS error?

I was unable to reproduce the DIS error message, just the qsub hang. And that was by a nc to the port qsub was listening on.

Just for curiosity, how did you get more that 1024 active connections to the pbs server? That seems wrong. Can you run an lsof on the server to see why so many file descriptors are tied up?