Sigpipe on interactive job

hi,

We run into a strange bug when submitting interactive jobs.
A job that can run quickly
qsub -X -I -l select=1:ncpus=1:mem=10gb
qsub: waiting for job 23913.pbs59 to start
qsub: job 23913.pbs59 ready

will just run normally
but when it has to wait a bit,

qsub -X -I -l select=100:ncpus=72:mem=10gb
qsub: waiting for job 23917.pbs59 to start
qsub: SIGPIPE received, job submission interrupted.: Connection reset by peer

it will crash with above message but the jobs stays queued and there is not error message anywhere. is this a bug or some system setting we run into?

thanks
Thomas

My guess is that the host where you run qsub is being port-scanned. If the job starts up between scans, everything is okay. If a scan happens while qsub is waiting for the job to start, the scan confuses qsub.

As a test, submit the long-waiting job without the -X option. If you get an odd message while the qsub is waiting, but the job eventually starts okay, then some kind of scan is the likely culprit.

hi,

thanks for your message. it put is on the right track. We just realized that we currently restart the scheduler every hour to get rid of zombie processes. that correlated pretty well with the timeout after 1 hour that we see with users… we will patch our version and test it to reduce the restarts to once every 24 hours like we did with pbspro.

thanks
Thomas