Is there a command to find out which jobs ran on a node?

brownwrap · August 24, 2017, 10:59pm

We have some nodes that might be causing an issue. It seems that when a job that requires two nodes, lands on say, compute-0-4, the job times out. If the job runs entirely on compute-0-4, it successfully completes. I am try to find another job that has ran on compute-0-4 and used another node that completed successfully. Is there a way to find all jobs that have ran on compute-0-4? Thanks.

scc · August 25, 2017, 3:02pm

The best way would be to look in your accounting logs. The following is not at all elegant, but something like this might help:

. /etc/pbs.conf ; grep “compute-0-4” $PBS_HOME/server_priv/accounting/201708* | grep ;\E; | grep “Exit_status=0” | awk -F; ‘{print $3 " " $4}’ | awk ‘{print $1 " " $12}’ | grep ‘+’

That should give you a list of job IDs to start looking at (from August, adjust the “201708*” as needed of course). It should print the job ID and the exec_vnode string for successful jobs that ran on more than one vnode, one of which was “compute-0-4”. It does not filter multi-chunk jobs that had all chunks running on compute-0-4, and the list may not be perfect in other ways so treat it as as starting point.

When you say that the job “times out”, what exactly do you mean? Do you see that PBS is successfully launching the job script and some command or other in the script is timing out, or do you see that the job script is never actually launched? The mom logs from all nodes in the job may be revealing, and/or the job’s stdout/err files depending on where it is getting hung up.

Does it seem to matter if compute-0-4 is the primary or secondary node in the job?

Topic		Replies	Views
Execution node down Users/Site Administrators	7	2658	August 9, 2019
Plenty of free nodes - yet Jobs are being held Users/Site Administrators	2	670	December 21, 2019
Is there any way to tell job's vnode within it jobscript Users/Site Administrators	3	763	March 28, 2018
Jobs maybe running in one node, possible reason for getting killed Users/Site Administrators	7	212	July 9, 2024
Jobs fail with more than 1 per node Users/Site Administrators	1	397	July 29, 2021

Is there a command to find out which jobs ran on a node?

Related topics