Jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool

Hi,

Recently, it hit a weird issue with PBS that jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool.

  1. There are 52 nodes in node_pool ‘research’
  2. It could make sure total 20 nodes are completely free in node_pool ‘research’
  3. Each node has 128 cores and 500GB
  4. No max. job numbers. and max. ncpus restrictions for the queue ‘rnd’
  5. job just requests 1 chunk with partial resources of single node.
  6. the ‘place’ is ‘free’

PBS schduler just held free nodes and didn’t relase them to jobs in the queue ‘rnd’
By checking with ‘qstat -f’, it showed ‘Insufficient amoutn of resource : node_pool’.

I have no clue what went wrong. Wondering how to figure out the root cause.

Here are appropriate jobs, nodes, and queue info

PBS Version : 2021.1.3

Free nodes in node_pool ‘research’

[jliu@x3003c0s11b0n0:scripts]$ ./pbsnodes_pool_status.sh  | grep ' 502gb/502gb 128/128'
x3001c0s9b1n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3001c0s9b2n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3001c0s9b4n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3001c0s11b1n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3001c0s11b2n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3001c0s13b3n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s3b3n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s5b2n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s7b2n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s9b2n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s9b3n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s9b4n0   free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s11b1n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s11b2n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s13b1n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s13b2n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s13b3n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s13b4n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s15b1n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --
x3002c0s15b2n0  free                 0     0      0  502gb/502gb 128/128     0/0     0/0 --

Jobs in ‘Q’ for queue 'rnd

[jliu@x3003c0s11b0n0:scripts]$ qstat -u prodops | grep ' Q '
2190951.x3003c* prodops  rnd      stageJob   152448   1  97  388gb 01:00 Q 00:00
2190953.x3003c* prodops  rnd      stageJob   152779   1  97  388gb 01:00 Q 00:00
2190954.x3003c* prodops  rnd      stageJob   152829   1  97  388gb 01:00 Q 00:00
2190955.x3003c* prodops  rnd      stageJob   153160   1  97  388gb 01:00 Q 00:00
2190956.x3003c* prodops  rnd      stageJob   153488   1  97  388gb 01:00 Q 00:00
2190957.x3003c* prodops  rnd      stageJob   153548   1  97  388gb 01:00 Q 00:00
2190958.x3003c* prodops  rnd      stageJob   153876   1  97  388gb 01:00 Q 00:00
2190959.x3003c* prodops  rnd      stageJob   154206   1  97  388gb 01:00 Q 00:00
2190960.x3003c* prodops  rnd      stageJob   154257   1  97  388gb 01:00 Q 00:00
2190961.x3003c* prodops  rnd      stageJob   154581   1  97  388gb 01:00 Q 00:00
2190962.x3003c* prodops  rnd      stageJob   154910   1  97  388gb 01:00 Q 00:00
2190963.x3003c* prodops  rnd      stageJob   154957   1  97  388gb 01:00 Q 00:00
2190964.x3003c* prodops  rnd      stageJob   155280   1  97  388gb 01:00 Q 00:00
2190965.x3003c* prodops  rnd      stageJob   155605   1  97  388gb 01:00 Q 00:00
2190966.x3003c* prodops  rnd      stageJob   155875   1  97  388gb 01:00 Q 00:00
2190967.x3003c* prodops  rnd      stageJob   155981   1  97  388gb 01:00 Q 00:00
2190968.x3003c* prodops  rnd      stageJob   156309   1  97  388gb 01:00 Q 00:00
2190969.x3003c* prodops  rnd      stageJob   156637   1  97  388gb 01:00 Q 00:00
2190970.x3003c* prodops  rnd      stageJob   156688   1  97  388gb 01:00 Q 00:00
2190971.x3003c* prodops  rnd      stageJob   157016   1  97  388gb 01:00 Q 00:00
2190972.x3003c* prodops  rnd      stageJob   157347   1  97  388gb 01:00 Q 00:00
2190973.x3003c* prodops  rnd      stageJob   157396   1  97  388gb 01:00 Q 00:00
2190975.x3003c* prodops  rnd      stageJob   157737   1  97  388gb 01:00 Q 00:00
2190976.x3003c* prodops  rnd      stageJob   158066   1  97  388gb 01:00 Q 00:00
2190977.x3003c* prodops  rnd      stageJob   158115   1  97  388gb 01:00 Q 00:00
2190978.x3003c* prodops  rnd      stageJob   158443   1  97  388gb 01:00 Q 00:00
2190979.x3003c* prodops  rnd      stageJob   158793   1  97  388gb 01:00 Q 00:00
2190980.x3003c* prodops  rnd      stageJob   159129   1  97  388gb 01:00 Q 00:00
2190981.x3003c* prodops  rnd      stageJob   159181   1  97  388gb 01:00 Q 00:00
2190982.x3003c* prodops  rnd      stageJob   159513   1  97  388gb 01:00 Q 00:00
2190983.x3003c* prodops  rnd      stageJob   159840   1  97  388gb 01:00 Q 00:00
2190984.x3003c* prodops  rnd      stageJob   159891   1  97  388gb 01:00 Q 00:00
2190985.x3003c* prodops  rnd      stageJob   160218   1  97  388gb 01:00 Q 00:00
2190986.x3003c* prodops  rnd      stageJob   160557   1  97  388gb 01:00 Q 00:00
2190987.x3003c* prodops  rnd      stageJob   160679   1  97  388gb 01:00 Q 00:00

Job details

[jliu@x3003c0s11b0n0:scripts]$ qstat -f 2190987
Job Id: 2190987.x3003c0s17b0n0
    Job_Name = stageJob
    Job_Owner = prodops@x3003c0s13b0n0.head.cm.weather.gov.sg
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:00
    resources_used.mem = 33692kb
    resources_used.ncpus = 97
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = Q
    queue = rnd
    server = x3003c0s17b0n0
    Checkpoint = u
    ctime = Thu Sep  5 07:26:02 2024
    Error_Path = x3003c0s13b0n0.head.cm.weather.gov.sg:/home/prodops/cylc-run/m
        ss-aa999/log/job/20240905T0000Z/run_eceps_batch_7/01/job.err
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Sep  5 10:23:33 2024
    Output_Path = x3003c0s13b0n0.head.cm.weather.gov.sg:/home/prodops/cylc-run/
        mss-aa999/log/job/20240905T0000Z/run_eceps_batch_7/01/job.out
    Priority = 0
    qtime = Thu Sep  5 07:26:02 2024
    Rerunable = True
    Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
    Resource_List.mem = 388gb
    Resource_List.ncpus = 97
    Resource_List.ngpus = 0
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.queue_charge_rate = 1
    Resource_List.select = 1:ncpus=97:mem=388GB
    Resource_List.walltime = 01:00:00
    session_id = 160679
    substate = 10
    Variable_List = PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=prodops,
        PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc
        /7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/
        mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/ap
        p/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/
        scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1
        .9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib
        2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/
        project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/ap
        p/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/py
        gobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/p
        e/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2
        /binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binuti
        ls/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/o
        pt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/
        pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sb
        in:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/loc
        al/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/b
        in,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,
        PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,
        PBS_O_HOST=x3003c0s13b0n0.head.cm.weather.gov.sg
    comment = Not Running: Insufficient amount of resource: node_pool
    etime = Thu Sep  5 07:26:02 2024
    umask = 22
    run_count = 3
    Exit_status = -11
    Submit_arguments = /home/prodops/cylc-run/mss-aa999/log/job/20240905T0000Z/
        run_eceps_batch_7/01/job
    project = prod_ops
    Submit_Host = x3003c0s13b0n0.head.cm.weather.gov.sg

Queue ‘rnd’ configs

[jliu@x3003c0s11b0n0:scripts]$ qmgr -c "p q rnd"
#
# Create queues and set their attributes.
#
#
# Create and define queue rnd
#
create queue rnd
set queue rnd queue_type = Execution
set queue rnd Priority = 130
set queue rnd acl_user_enable = False
set queue rnd resources_max.ngpus = 0
set queue rnd resources_max.walltime = 24:00:00
set queue rnd default_chunk.node_pool = research
set queue rnd enabled = True
set queue rnd started = True

Apprecating your inputs.

Regards

Jerry

Could you please share the below command ouptout

  1. qstat -Bf
  2. pbsnodes -av
  3. tracejob 2190987

For some reason the job was rerun 3 times, if you see Exit_stat = -11 and run_count = 3
Please check the mom logs of the compute nodes on which this job was placed on.

Also , if you grep for node_pool on the qstat -f 2190987 , we do not see any request for research

Hi adarsh,

Thanks for your inputs.

  1. tracejob showed job 2190987 was preempted immediatly after it started to run
  2. mom log showed “suspend failed”.
  3. The immediatly followed job 2190988 also got the same problem.

I could make sure no preempt queue was configured. Confused where the preempt signal came from,

PS, sorry not able to share pbsndoes -av, as it’s too big and neither am I able to upload attachment.

Here are appropriate info.

qstat -Bf

Server: x3003c0s17b0n0
    server_state = Active
    server_host = x3003c0s17b0n0
    scheduling = True
    total_jobs = 99267
    state_count = Transit:0 Queued:48 Held:0 Waiting:0 Running:35 Exiting:0 Beg
	un:0 
    acl_resv_user_enable = True
    acl_resv_users = jliu,root,weng,zhangdx
    acl_roots = root@*
    default_queue = workq
    log_events = 511
    mail_from = adm
    query_other_jobs = True
    resources_available.am_finished_job = 2171878.x3003c0s17b0n0 20240831
    resources_default.mem = 10mb
    resources_default.ncpus = 1
    resources_default.queue_charge_rate = 1
    resources_default.walltime = 00:05:00
    default_chunk.ncpus = 1
    resources_assigned.mem = 34792gb
    resources_assigned.ncpus = 8902
    resources_assigned.nodect = 71
    scheduler_iteration = 30
    flatuid = True
    resv_enable = True
    node_fail_requeue = 700
    max_array_size = 10000
    pbs_license_info = 6200@admin
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 31536000
    license_count = Avail_Global:6 Avail_Local:40 Used:158 High_Use:158
    pbs_version = 2021.1.3.20220217134230
    job_sort_formula = ((log10(ncpus)/10)+(1-(log10(walltime)/10)))/2
    eligible_time_enable = False
    job_history_enable = True
    max_concurrent_provision = 5
    max_job_sequence_id = 9999999

jobtrace 2190987

Job: 2190987.x3003c0s17b0n0

09/05/2024 07:26:02  S    enqueuing into rnd, state 1 hop 1
09/05/2024 07:26:02  S    Job Queued at request of prodops@x3003c0s13b0n0, owner = prodops@x3003c0s13b0n0, job name = stageJob, queue = rnd
09/05/2024 07:26:02  S    Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 07:26:02  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 
09/05/2024 10:14:47  L    Received qrun request
09/05/2024 10:14:50  S    'am_hook' hook set job's Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:14:50  S    Job Run at request of Scheduler@x3003c0s17b0n0 on exec_vnode (x3001c0s7b3n0:ncpus=97:mem=406847488kb)
09/05/2024 10:14:50  S    signal job request received
09/05/2024 10:14:50  S    suspend job by Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  S    job signaled with suspend by Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  S    Execution server rejected request
09/05/2024 10:14:50  S    preemption method suspend failed for job (15016)
09/05/2024 10:14:50  S    Holds s set at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  S    preemption method checkpoint failed for job (15029)
09/05/2024 10:14:50  S    Job Rerun at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  L    Job run
09/05/2024 10:14:50  A    Resource_List.am_job_cache={"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:14:50  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531290 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 resource_assigned.mem=406847488kb resource_assigned.ncpus=97 
09/05/2024 10:14:51  S    Obit received momhop:1 serverhop:1 state:4 substate:60
09/05/2024 10:14:53  S    Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:53  L    Job preempted by requeuing
09/05/2024 10:14:53  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531290 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 session=126109 end=1725531293 Exit_status=-11 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2796kb resources_used.ncpus=97 resources_used.vmem=12316kb resources_used.walltime=00:00:00 run_count=1
09/05/2024 10:18:43  L    Received qrun request
09/05/2024 10:18:47  S    'am_hook' hook set job's Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:18:47  S    Job Run at request of Scheduler@x3003c0s17b0n0 on exec_vnode (x3001c0s7b3n0:ncpus=97:mem=406847488kb)
09/05/2024 10:18:47  S    Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    Execution server rejected request
09/05/2024 10:18:47  S    Holds s set at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    Job Rerun at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  L    Job run
09/05/2024 10:18:47  A    Resource_List.am_job_cache={"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:18:47  S    signal job request received
09/05/2024 10:18:47  S    suspend job by Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    job signaled with suspend by Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    preemption method suspend failed for job (15016)
09/05/2024 10:18:47  S    preemption method checkpoint failed for job (15029)
09/05/2024 10:18:47  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531527 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 resource_assigned.mem=406847488kb resource_assigned.ncpus=97 
09/05/2024 10:18:48  S    Obit received momhop:2 serverhop:2 state:4 substate:60
09/05/2024 10:18:50  L    Job preempted by requeuing

mom log

09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2087)
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:50;0004;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:50;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:50;0004;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_launch, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 2048 (elapsed time: 0.2593)
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_cgroups;env orig: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190987.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=1F3137392E31EED366E50ECE07D09AED,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190987.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190987.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_cgroups;env new: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190987.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=1F3137392E31EED366E50ECE07D09AED,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190987.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190987.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH,CUDA_VISIBLE_DEVICES=
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;Started, pid = 126109
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU usage: 0.007 secs
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: Memory usage: mem=2796kb
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 terminated
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;Terminated
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 256 (elapsed time: 0.6490)
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:51;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:51;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: No CPU usage data
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 256 (elapsed time: 0.3052)
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:51;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;Obit sent
09/05/2024 10:14:51;0100;pbs_mom;Req;;Type 14 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:52;0100;pbs_mom;Req;;Type 6 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:52;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;delete job request received
09/05/2024 10:14:52;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_end, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:52;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 512 (elapsed time: 0.0112)
09/05/2024 10:14:52;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:53;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:53;0100;pbs_mom;Req;;Type 1 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:53;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2182)
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:54;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_launch, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 2048 (elapsed time: 0.2699)
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_cgroups;env orig: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190988.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=70D87BA45B2BD3382D97B2524A102E4F,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190988.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190988.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH

... ...

09/05/2024 10:14:53;0100;pbs_mom;Req;;Type 1 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:53;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2182)
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:54;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job

Thank you Jerry for sharing this information, appreciated and understand the limitation.

Please check the below

  • check the configuration in qmgr -c “print sched”
    – check for preempt_* configurations (mainly preempt_order)
    – if you have implemente s / r , whether you have scripts created and mentioned in the mom_priv/config on the compute nodes.
  • the server logs and scheduler logs to find out what happend with respect to these two jobs 2190987 , 2190987 (to see whether suspend was manual sent or someone forced the job with qrun)
  • check the $PBS_HOME/sched_priv/sched_config # whether preemptive_sched is turned on and what is the
  • please check whether there are any uncleaned cgroups of the jobs (jobs that might have run previously and have not cleaned upt the cgroup folders )

Adarsh,

Thank you very much for the further inputs.

The preempt message is a little bit mis-leading. I belive it’'s due to the qrun.
I did su to root and force the job to run with qrun and tried to see if the job could be dispatched to the ‘free’ nodes

Managed to get the reason, mom nodes were in an ‘idle’ status and didn’t work as expected, the mom log was zero, though systemctl status showed it’s active and running, pbsnodes also showed everything is OK.

Fix should be very simple, just reboot the node or restart the mom service.

Wondering how to dig out the root cause or what’re possible reasons. I didn’t get any error message or warning message for those ‘idling nodes’ from server log and scheduler log. Looked like server still thought all nodes are ‘good’.

If it happens the next time or if you can replicate the issue

  • increase the daemon log level of server,scheduler , mom (trace the problematic job throught these verbose logs)
  • get the gstack of the mom daemon
  • check whether cgroups does not have zombie job directories
  • If you check the PBS Professional 2022.1 Reference Guide RG-365
    Table 8-6: Vnode States # the state of the node should be one of mentioned in the table, it cannot be idle. You might have meant scheduler is idle state.