We have some nodes that might be causing an issue. It seems that when a job that requires two nodes, lands on say, compute-0-4, the job times out. If the job runs entirely on compute-0-4, it successfully completes. I am try to find another job that has ran on compute-0-4 and used another node that completed successfully. Is there a way to find all jobs that have ran on compute-0-4? Thanks.
The best way would be to look in your accounting logs. The following is not at all elegant, but something like this might help:
. /etc/pbs.conf ; grep “compute-0-4” $PBS_HOME/server_priv/accounting/201708* | grep ;\E; | grep “Exit_status=0” | awk -F; ‘{print $3 " " $4}’ | awk ‘{print $1 " " $12}’ | grep ‘+’
That should give you a list of job IDs to start looking at (from August, adjust the “201708*” as needed of course). It should print the job ID and the exec_vnode string for successful jobs that ran on more than one vnode, one of which was “compute-0-4”. It does not filter multi-chunk jobs that had all chunks running on compute-0-4, and the list may not be perfect in other ways so treat it as as starting point.
When you say that the job “times out”, what exactly do you mean? Do you see that PBS is successfully launching the job script and some command or other in the script is timing out, or do you see that the job script is never actually launched? The mom logs from all nodes in the job may be revealing, and/or the job’s stdout/err files depending on where it is getting hung up.
Does it seem to matter if compute-0-4 is the primary or secondary node in the job?