Sporadic DIS errors in Server_Log

Seeing some sporadic DIS errors when -W block=True is requested:

./20250103:01/03/2025 14:19:32;0001;Server@sched;Svr;Server@sched;check_block_wt, DIS error while replying to client gitrun3 for job 164004.sched

The job exits cleaning (status 0), but the submission terminal hangs as it didn’t receive a response from the Scheduler.

I see this in the MOM logs:
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;task 00000001 terminated
01/28/2025 11:02:39;0800;pbs_mom;n/a;mom_get_sample;nprocs: 615, cantstat: 2, nomem: 0, skipped: 492, cached: 0
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;Terminated
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;task 00000001 cput=00:13:00
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;kill_job
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;node9 cput=00:13:00 mem=1275332kb
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;Obit sent
01/28/2025 11:02:39;0100;pbs_mom;Req;;Type 54 request received from root@10.10.38.18:15001, sock=5
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;copy file request received
01/28/2025 11:02:39;0800;pbs_mom;Job;stage_file;Skipping directly written/absent spool file /var/spool/pbs/spool/179807.sched.OU
01/28/2025 11:02:39;0800;pbs_mom;Job;stage_file;Skipping directly written/absent spool file /var/spool/pbs/spool/179807.sched.ER
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;staged 2 items out over 0:00:00
01/28/2025 11:02:39;0800;pbs_mom;n/a;mom_get_sample;nprocs: 616, cantstat: 2, nomem: 0, skipped: 492, cached: 0
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;no active tasks
01/28/2025 11:02:39;0100;pbs_mom;Req;;Type 6 request received from root@10.10.38.18:15001, sock=5
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;delete job request received
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;kill_job

This reminds me of a problem that used to exist if Security ran a network port scan against the qsub host while qsub was blocked. In testing just now, I was able to reproduce the qsub hang, but not the error message on the server.

Can you check logs to see if a port scan was running at the times of the DIS errors?

Unfortunately i can’t find any evidence of a port scan in the logs. I did notice server_logs complaining about a low file limit (1024), and i increased that to 65536 which seems to have resolved the issue.

I tried to reproduce by running a repetitive netcat to the TCP ports used by qsub on the submission host, but wasn’t able to.

Just in case, the pbs_comm daemons open file limit is dependent on the limits set at the operating system level. Having mentioned this , i have not seen this error regarding DIS , most of it was related to 1) what @dtalcott mentioned 2) different versions of the openpbs client commands in the mix or two incomptiable versions of the daemons.