Interest in python qstat?

For many years, NASA’s Advanced Supercomputing (NAS) Division has taken advantage of the “TCL_QSTAT” compilation option to build a version of qstat that uses TCL scripts to format the output. The scripts NAS developed allow the system admins and users to customize qstat output without needing to build special versions of qstat. For example, on systems with GPUs, the number of GPUs requested by a job is included in the default display, but not on clusters with no GPUs. Similarly, users can add or remove fields using command line options or with a $HOME/.qstatrc configuration file. Field widths automatically adjust to accommodate requested information, or users can specify min and max widths for individual fields.

Now, however, Altair wants to remove the TCL option from qstat to simplify maintenance. This is going to be disruptive for NAS users. So, I have been working on a python version of qstat. If there is enough interest, I’ll propose adding it to the unsupported portion of OpenPBS. Having a user-customizable version of qstat could reduce demand for yet more features in the base qstat.

Question for Altair: What are the plans for the pbs_ifl python module currently part of PTL? Will it be made more generally available to pbs_python?

(I considered a script that starts with qstat -f -F json output, but rejected it as too inefficient. Also, qstat -f grabs all info about a job from the server, which takes more than twice as long as asking just for the attributes of interest.)

The current python qstat (called “nas_qstat”) recognizes the following fields:

Acct Aoe Cpct Cput Ctime Eff Elapwallt Eligtime Endtime EstStart ExitStatus Group JobID Jobname Lifetime Maxwallt Memory Minwallt Mission Model Nds Place Pmem Pri Qtime Queue ReqID Reqmem Remwallt Reqdwallt Runs S SessID SeqNo Ss Stime TSK User Vmem.

Some of the fields are sourced directly from the pbs_statjob results and others are computed (e.g., eff = CPU efficiency).

So, for example, where the normal qstat gives:

Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
8.server2         STDIN            dtalcott                 0 H playq           
21.server2        STDIN            dtalcott                 0 Q playq           
22.server2        STDIN            dtalcott                 0 Q playq           
30.server2        STDIN            dtalcott          00:15:45 R workq           
31.server2        longish_name_5   dtalcott          00:00:10 R workq           

With no options, nas_qstat gives:

                                                 Req'd     Elap
JobID      User     Queue Jobname        TSK Nds wallt S  wallt  Eff
---------- -------- ----- -------------- --- --- ----- - ------ ----
30.server2 dtalcott workq STDIN            1   1 00:33 R  00:16  99%
31.server2 dtalcott workq longish_name_5   1   1 00:33 R  00:00 100%
21.server2 dtalcott playq STDIN            1   1 00:33 Q 213:56   --
22.server2 dtalcott playq STDIN            1   1 00:33 Q 113:46   --
8.server2  dtalcott playq STDIN            1   1 00:03 H  00:00   --

You can add the remaining walltime to the list with

nas_qstat -W o=+remwallt
                                                 Req'd     Elap       Rem
JobID      User     Queue Jobname        TSK Nds wallt S  wallt Eff wallt
---------- -------- ----- -------------- --- --- ----- - ------ --- -----
30.server2 dtalcott workq STDIN            1   1 00:33 R  00:17 99% 00:17
31.server2 dtalcott workq longish_name_5   1   1 00:33 R  00:01 99% 00:32
21.server2 dtalcott playq STDIN            1   1 00:33 Q 213:56  -- 00:33
22.server2 dtalcott playq STDIN            1   1 00:33 Q 113:47  -- 00:33
8.server2  dtalcott playq STDIN            1   1 00:03 H  00:00  -- 00:00

If your users like long job names, you can limit them via

nas_qstat -W fmt_jobname=maxw:10
                                             Req'd     Elap
JobID      User     Queue Jobname    TSK Nds wallt S  wallt Eff
---------- -------- ----- ---------- --- --- ----- - ------ ---
30.server2 dtalcott workq STDIN        1   1 00:33 R  00:17 99%
31.server2 dtalcott workq longish_na   1   1 00:33 R  00:02 99%
21.server2 dtalcott playq STDIN        1   1 00:33 Q 213:57  --
22.server2 dtalcott playq STDIN        1   1 00:33 Q 113:47  --
8.server2  dtalcott playq STDIN        1   1 00:03 H  00:00  --

Sometimes, the beginnings and ends of job names are important. You can get that by specifying “end” justification with a truncation character of “*”:

nas_qstat -W fmt_jobname='maxw:10 hj:e rt:*'
                                             Req'd     Elap
JobID      User     Queue Jobname    TSK Nds wallt S  wallt Eff
---------- -------- ----- ---------- --- --- ----- - ------ ---
30.server2 dtalcott workq STDIN        1   1 00:33 R  00:22 99%
31.server2 dtalcott workq long*ame_5   1   1 00:33 R  00:06 99%
21.server2 dtalcott playq STDIN        1   1 00:33 Q 214:02  --
22.server2 dtalcott playq STDIN        1   1 00:33 Q 113:52  --
8.server2  dtalcott playq STDIN        1   1 00:03 H  00:00  --

The -a option of nas_qstat includes node information in a summarized format:

nas_qstat -a -r
server2:     Sat May  1 08:54:47 2021
 Server reports 4 jobs total (T:0 Q:2 H:1 W:0 R:1 E:0 B:0)

           CPUs/
  Host     used/free Tasks Jobs Info
  -------- ----/---- ----- ---- -----------------
  node3       1/   1     1    1
   3 hosts    0/   0     0    0 offline down
  node4       0/   1     0    0 vmac offline down
                                                 Req'd    Elap
JobID      User     Queue Jobname        TSK Nds wallt S wallt Eff
---------- -------- ----- -------------- --- --- ----- - ----- ---
31.server2 dtalcott workq longish_name_5   1   1 00:33 R 00:24 98%

It’s tricky, but admins and users can add new fields (perhaps computed on the fly) without modifying the base code.

So, if you are interested, you can fetch the work in progress from GitHub - drtoss/pyqs: Python version of PBS qstat and pbs_rstat. You’ll need to modify the ‘build_pbs_ifl’ script to tell it where to find swig and where to find the OpenPBS source tree. This script builds the pbs_ifl module using pieces that are normally part of PTL. It also creates nas_utils.py as a subset of the BatchUtils class in PTL.

Pieces:
nas_field_format.py – Functions to compute string values for fields
nas_layout.py – The layout engine that handles field justification, widths, headers, etc.
nas_pbsutil.py – Python versions of C routines used by C qstat
nas_qstat – Main routine of python qstat
nas_rstat – Python version of pbs_rstat (used as prototype for nas_qstat)
pbs_ifl.i – Slightly modified version of OpenPBS’s swig input file to build pbs_ifl module

The file prof.out is the pstats output from profiling an earlier nas_qstat on a host with 38,000 jobs.

I am particularly interested in code speedups. For example, on the host with 38k jobs (active and in history), OpenPBS qstat takes 0.75 second, whereas an earlier python qstat takes 2.5 seconds. (The TCL qstat takes ~20 seconds, so this version is already better than what NAS has been using.)

1 Like

Hi

I also use pbs_ifl.h and swig to get PBS info via Python. See

This provides me with a very fast interface to getting info on jobs, queues and nodes.
Much faster than any json output.
I will have a closer read of the above.

Mike

The Python program also provides some useful stats such as CPU utilisation.

$ ./check_utilisation.py
$ usage: check_utilisation.py  running|finished|all  [-h] [-u USER] [-e EMAIL]
check_utilisation.py: error: the following arguments are required: state
$ ./check_utilisation.py running

Checking utilisation for jobs after 2021-04-28 17:14 PM
Found 52 running jobs out of 74 total jobs.
Job ID  Job Owner         Job Name      Select Statement ncpus  cpu%  cputime  walltime  CPU Util  TIME Util  Comment
                                                                     (hours)   (hours) (percent)  (percent)
142365  u11757628  anc_2index_part    1:mem=16gb:ncpus=1     1    98     93.0      94.4      8.0%       8.6%  Good
143292  u11757628  anc_2index_part    1:mem=16gb:ncpus=1     1    97     73.1      74.2     97.0%      98.6%  Good
146612    u120274            STDIN   1:ncpus=48:mem=500g    48   646      0.9       0.4     13.5%       4.2%  CHECK !
140999    u127399           run.sh     1:mem=2gb:ncpus=2     2   173    252.3     145.1     86.5%      86.9%  Good
146561    u128038  nf-LOOKUP_TABLE  1:ncpus=6:mem=204800     6   599     33.3       5.7     99.8%      97.8%  Good

This uses our in-house developed library to pbs stats via pbs_ifl.h as in my previous post.
I definitely would not like to see our ability to query PBS via a fast API removed.

Mike

Thanks - we will give it a look now.

Does the version of swig matter? We are still running on RHEL/CentOS7 and 3.0.12 seems to be the latest available. Happy to install from source, but would prefer to use provided.

Ah! The version of PBS matters. There’s no libauth.h in 19.1.3

I used to run it on Centos6 until earlier this year. You just need to use the swig that comes with your distro.

1 Like

Turns out you don’t need libauth.h on pbs19. Just ignore the error message from grep.

Also, the #! line at the beginning of nas_qstat has the NAS path to pbs_python hard-coded. Change it to match wherever you have pbs_python installed. I thought about using the

#!/usr/bin/env pbs_python

technique, but that adds extra overhead I’m trying to reduce.

1 Like

Thanks @dtalcott and @speleolinux - appreciated.

FWIW, I added the first pass at a man page: nas_qstat.1. It has all the options described, but still needs EXAMPLES and an explanation of how to do local customization.