Jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool

Hi adarsh,

Thanks for your inputs.

  1. tracejob showed job 2190987 was preempted immediatly after it started to run
  2. mom log showed “suspend failed”.
  3. The immediatly followed job 2190988 also got the same problem.

I could make sure no preempt queue was configured. Confused where the preempt signal came from,

PS, sorry not able to share pbsndoes -av, as it’s too big and neither am I able to upload attachment.

Here are appropriate info.

qstat -Bf

Server: x3003c0s17b0n0
    server_state = Active
    server_host = x3003c0s17b0n0
    scheduling = True
    total_jobs = 99267
    state_count = Transit:0 Queued:48 Held:0 Waiting:0 Running:35 Exiting:0 Beg
	un:0 
    acl_resv_user_enable = True
    acl_resv_users = jliu,root,weng,zhangdx
    acl_roots = root@*
    default_queue = workq
    log_events = 511
    mail_from = adm
    query_other_jobs = True
    resources_available.am_finished_job = 2171878.x3003c0s17b0n0 20240831
    resources_default.mem = 10mb
    resources_default.ncpus = 1
    resources_default.queue_charge_rate = 1
    resources_default.walltime = 00:05:00
    default_chunk.ncpus = 1
    resources_assigned.mem = 34792gb
    resources_assigned.ncpus = 8902
    resources_assigned.nodect = 71
    scheduler_iteration = 30
    flatuid = True
    resv_enable = True
    node_fail_requeue = 700
    max_array_size = 10000
    pbs_license_info = 6200@admin
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 31536000
    license_count = Avail_Global:6 Avail_Local:40 Used:158 High_Use:158
    pbs_version = 2021.1.3.20220217134230
    job_sort_formula = ((log10(ncpus)/10)+(1-(log10(walltime)/10)))/2
    eligible_time_enable = False
    job_history_enable = True
    max_concurrent_provision = 5
    max_job_sequence_id = 9999999

jobtrace 2190987

Job: 2190987.x3003c0s17b0n0

09/05/2024 07:26:02  S    enqueuing into rnd, state 1 hop 1
09/05/2024 07:26:02  S    Job Queued at request of prodops@x3003c0s13b0n0, owner = prodops@x3003c0s13b0n0, job name = stageJob, queue = rnd
09/05/2024 07:26:02  S    Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 07:26:02  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 
09/05/2024 10:14:47  L    Received qrun request
09/05/2024 10:14:50  S    'am_hook' hook set job's Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:14:50  S    Job Run at request of Scheduler@x3003c0s17b0n0 on exec_vnode (x3001c0s7b3n0:ncpus=97:mem=406847488kb)
09/05/2024 10:14:50  S    signal job request received
09/05/2024 10:14:50  S    suspend job by Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  S    job signaled with suspend by Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  S    Execution server rejected request
09/05/2024 10:14:50  S    preemption method suspend failed for job (15016)
09/05/2024 10:14:50  S    Holds s set at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  S    preemption method checkpoint failed for job (15029)
09/05/2024 10:14:50  S    Job Rerun at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50  L    Job run
09/05/2024 10:14:50  A    Resource_List.am_job_cache={"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:14:50  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531290 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 resource_assigned.mem=406847488kb resource_assigned.ncpus=97 
09/05/2024 10:14:51  S    Obit received momhop:1 serverhop:1 state:4 substate:60
09/05/2024 10:14:53  S    Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:53  L    Job preempted by requeuing
09/05/2024 10:14:53  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531290 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 session=126109 end=1725531293 Exit_status=-11 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2796kb resources_used.ncpus=97 resources_used.vmem=12316kb resources_used.walltime=00:00:00 run_count=1
09/05/2024 10:18:43  L    Received qrun request
09/05/2024 10:18:47  S    'am_hook' hook set job's Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:18:47  S    Job Run at request of Scheduler@x3003c0s17b0n0 on exec_vnode (x3001c0s7b3n0:ncpus=97:mem=406847488kb)
09/05/2024 10:18:47  S    Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    Execution server rejected request
09/05/2024 10:18:47  S    Holds s set at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    Job Rerun at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  L    Job run
09/05/2024 10:18:47  A    Resource_List.am_job_cache={"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:18:47  S    signal job request received
09/05/2024 10:18:47  S    suspend job by Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    job signaled with suspend by Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47  S    preemption method suspend failed for job (15016)
09/05/2024 10:18:47  S    preemption method checkpoint failed for job (15029)
09/05/2024 10:18:47  A    user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531527 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 resource_assigned.mem=406847488kb resource_assigned.ncpus=97 
09/05/2024 10:18:48  S    Obit received momhop:2 serverhop:2 state:4 substate:60
09/05/2024 10:18:50  L    Job preempted by requeuing

mom log

09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2087)
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:50;0004;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:50;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:50;0004;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_launch, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 2048 (elapsed time: 0.2593)
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_cgroups;env orig: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190987.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=1F3137392E31EED366E50ECE07D09AED,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190987.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190987.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_cgroups;env new: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190987.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=1F3137392E31EED366E50ECE07D09AED,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190987.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190987.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH,CUDA_VISIBLE_DEVICES=
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;Started, pid = 126109
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU usage: 0.007 secs
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: Memory usage: mem=2796kb
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 terminated
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;Terminated
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 256 (elapsed time: 0.6490)
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:51;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:51;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: No CPU usage data
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 256 (elapsed time: 0.3052)
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:51;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;Obit sent
09/05/2024 10:14:51;0100;pbs_mom;Req;;Type 14 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:52;0100;pbs_mom;Req;;Type 6 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:52;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;delete job request received
09/05/2024 10:14:52;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_end, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:52;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 512 (elapsed time: 0.0112)
09/05/2024 10:14:52;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:53;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:53;0100;pbs_mom;Req;;Type 1 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:53;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2182)
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:54;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_launch, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 2048 (elapsed time: 0.2699)
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_cgroups;env orig: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190988.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=70D87BA45B2BD3382D97B2524A102E4F,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190988.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190988.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH

... ...

09/05/2024 10:14:53;0100;pbs_mom;Req;;Type 1 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:53;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2182)
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:54;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job