Sporadic DIS errors in Server_Log

mgasp · January 30, 2025, 12:37am

Seeing some sporadic DIS errors when -W block=True is requested:

./20250103:01/03/2025 14:19:32;0001;Server@sched;Svr;Server@sched;check_block_wt, DIS error while replying to client gitrun3 for job 164004.sched

The job exits cleaning (status 0), but the submission terminal hangs as it didn’t receive a response from the Scheduler.

I see this in the MOM logs:
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;task 00000001 terminated
01/28/2025 11:02:39;0800;pbs_mom;n/a;mom_get_sample;nprocs: 615, cantstat: 2, nomem: 0, skipped: 492, cached: 0
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;Terminated
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;task 00000001 cput=00:13:00
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;kill_job
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;node9 cput=00:13:00 mem=1275332kb
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;Obit sent
01/28/2025 11:02:39;0100;pbs_mom;Req;;Type 54 request received from root@10.10.38.18:15001, sock=5
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;copy file request received
01/28/2025 11:02:39;0800;pbs_mom;Job;stage_file;Skipping directly written/absent spool file /var/spool/pbs/spool/179807.sched.OU
01/28/2025 11:02:39;0800;pbs_mom;Job;stage_file;Skipping directly written/absent spool file /var/spool/pbs/spool/179807.sched.ER
01/28/2025 11:02:39;0100;pbs_mom;Job;179807.sched;staged 2 items out over 0:00:00
01/28/2025 11:02:39;0800;pbs_mom;n/a;mom_get_sample;nprocs: 616, cantstat: 2, nomem: 0, skipped: 492, cached: 0
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;no active tasks
01/28/2025 11:02:39;0100;pbs_mom;Req;;Type 6 request received from root@10.10.38.18:15001, sock=5
01/28/2025 11:02:39;0080;pbs_mom;Job;179807.sched;delete job request received
01/28/2025 11:02:39;0008;pbs_mom;Job;179807.sched;kill_job

dtalcott · January 31, 2025, 11:33pm

This reminds me of a problem that used to exist if Security ran a network port scan against the qsub host while qsub was blocked. In testing just now, I was able to reproduce the qsub hang, but not the error message on the server.

Can you check logs to see if a port scan was running at the times of the DIS errors?

mgasp · February 1, 2025, 3:31pm

Unfortunately i can’t find any evidence of a port scan in the logs. I did notice server_logs complaining about a low file limit (1024), and i increased that to 65536 which seems to have resolved the issue.

I tried to reproduce by running a repetitive netcat to the TCP ports used by qsub on the submission host, but wasn’t able to.

adarsh · February 1, 2025, 7:51pm

Just in case, the pbs_comm daemons open file limit is dependent on the limits set at the operating system level. Having mentioned this , i have not seen this error regarding DIS , most of it was related to 1) what @dtalcott mentioned 2) different versions of the openpbs client commands in the mix or two incomptiable versions of the daemons.

mgasp · February 3, 2025, 2:40pm

Our PBS environment is homogenous at 20.0.1, we don’t have other versions installed. @dtalcott would you mind sharing how you were able to reproduce the DIS error?

dtalcott · February 5, 2025, 12:52am

I was unable to reproduce the DIS error message, just the qsub hang. And that was by a nc to the port qsub was listening on.

Just for curiosity, how did you get more that 1024 active connections to the pbs server? That seems wrong. Can you run an lsof on the server to see why so many file descriptors are tied up?

Topic		Replies	Views
Qsub blocking is not working Users/Site Administrators	4	1849	April 10, 2019
No permission error errno=15007 Users/Site Administrators	16	2443	January 28, 2021
PBS job submission problem Users/Site Administrators	2	673	August 15, 2023
Job stack in queue after fresh install \| Permission error 15008 Users/Site Administrators	2	2704	February 2, 2021
Qsub/qstat slow (or failing) with thousands of jobs submitted Users/Site Administrators	10	2792	July 29, 2021

Sporadic DIS errors in Server_Log

Related topics