Hi adarsh,
Thanks for your inputs.
- tracejob showed job 2190987 was preempted immediatly after it started to run
- mom log showed “suspend failed”.
- The immediatly followed job 2190988 also got the same problem.
I could make sure no preempt queue was configured. Confused where the preempt signal came from,
PS, sorry not able to share pbsndoes -av, as it’s too big and neither am I able to upload attachment.
Here are appropriate info.
qstat -Bf
Server: x3003c0s17b0n0
server_state = Active
server_host = x3003c0s17b0n0
scheduling = True
total_jobs = 99267
state_count = Transit:0 Queued:48 Held:0 Waiting:0 Running:35 Exiting:0 Beg
un:0
acl_resv_user_enable = True
acl_resv_users = jliu,root,weng,zhangdx
acl_roots = root@*
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_available.am_finished_job = 2171878.x3003c0s17b0n0 20240831
resources_default.mem = 10mb
resources_default.ncpus = 1
resources_default.queue_charge_rate = 1
resources_default.walltime = 00:05:00
default_chunk.ncpus = 1
resources_assigned.mem = 34792gb
resources_assigned.ncpus = 8902
resources_assigned.nodect = 71
scheduler_iteration = 30
flatuid = True
resv_enable = True
node_fail_requeue = 700
max_array_size = 10000
pbs_license_info = 6200@admin
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:6 Avail_Local:40 Used:158 High_Use:158
pbs_version = 2021.1.3.20220217134230
job_sort_formula = ((log10(ncpus)/10)+(1-(log10(walltime)/10)))/2
eligible_time_enable = False
job_history_enable = True
max_concurrent_provision = 5
max_job_sequence_id = 9999999
jobtrace 2190987
Job: 2190987.x3003c0s17b0n0
09/05/2024 07:26:02 S enqueuing into rnd, state 1 hop 1
09/05/2024 07:26:02 S Job Queued at request of prodops@x3003c0s13b0n0, owner = prodops@x3003c0s13b0n0, job name = stageJob, queue = rnd
09/05/2024 07:26:02 S Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 07:26:02 A user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00
09/05/2024 10:14:47 L Received qrun request
09/05/2024 10:14:50 S 'am_hook' hook set job's Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:14:50 S Job Run at request of Scheduler@x3003c0s17b0n0 on exec_vnode (x3001c0s7b3n0:ncpus=97:mem=406847488kb)
09/05/2024 10:14:50 S signal job request received
09/05/2024 10:14:50 S suspend job by Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50 S job signaled with suspend by Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50 S Execution server rejected request
09/05/2024 10:14:50 S preemption method suspend failed for job (15016)
09/05/2024 10:14:50 S Holds s set at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50 S preemption method checkpoint failed for job (15029)
09/05/2024 10:14:50 S Job Rerun at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:50 L Job run
09/05/2024 10:14:50 A Resource_List.am_job_cache={"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:14:50 A user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531290 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 resource_assigned.mem=406847488kb resource_assigned.ncpus=97
09/05/2024 10:14:51 S Obit received momhop:1 serverhop:1 state:4 substate:60
09/05/2024 10:14:53 S Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:14:53 L Job preempted by requeuing
09/05/2024 10:14:53 A user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531290 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 session=126109 end=1725531293 Exit_status=-11 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2796kb resources_used.ncpus=97 resources_used.vmem=12316kb resources_used.walltime=00:00:00 run_count=1
09/05/2024 10:18:43 L Received qrun request
09/05/2024 10:18:47 S 'am_hook' hook set job's Resource_List.am_job_cache = {"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:18:47 S Job Run at request of Scheduler@x3003c0s17b0n0 on exec_vnode (x3001c0s7b3n0:ncpus=97:mem=406847488kb)
09/05/2024 10:18:47 S Job Modified at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47 S Execution server rejected request
09/05/2024 10:18:47 S Holds s set at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47 S Job Rerun at request of Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47 L Job run
09/05/2024 10:18:47 A Resource_List.am_job_cache={"job.walltime": 3600, "job.ncpus": 97.0}
09/05/2024 10:18:47 S signal job request received
09/05/2024 10:18:47 S suspend job by Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47 S job signaled with suspend by Scheduler@x3003c0s17b0n0
09/05/2024 10:18:47 S preemption method suspend failed for job (15016)
09/05/2024 10:18:47 S preemption method checkpoint failed for job (15029)
09/05/2024 10:18:47 A user=prodops group=-default- project=prod_ops jobname=stageJob queue=rnd ctime=1725521162 qtime=1725521162 etime=1725521162 start=1725531527 exec_host=x3001c0s7b3n0/0*97 exec_vnode=(x3001c0s7b3n0:ncpus=97:mem=406847488kb) Resource_List.am_job_cache='{"job.walltime": 3600, "job.ncpus": 97.0}' Resource_List.mem=388gb Resource_List.ncpus=97 Resource_List.ngpus=0 Resource_List.nodect=1 Resource_List.place=free Resource_List.queue_charge_rate=1 Resource_List.select=1:ncpus=97:mem=388GB Resource_List.walltime=01:00:00 resource_assigned.mem=406847488kb resource_assigned.ncpus=97
09/05/2024 10:18:48 S Obit received momhop:2 serverhop:2 state:4 substate:60
09/05/2024 10:18:50 L Job preempted by requeuing
mom log
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:50;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0/
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2087)
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:50;0004;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:50;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:50;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:50;0004;pbs_mom;Job;2190987.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_launch, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 2048 (elapsed time: 0.2593)
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_cgroups;env orig: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190987.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=1F3137392E31EED366E50ECE07D09AED,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190987.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190987.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH
09/05/2024 10:14:50;0100;pbs_python;Hook;pbs_cgroups;env new: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190987.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=1F3137392E31EED366E50ECE07D09AED,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190987.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190987.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH,CUDA_VISIBLE_DEVICES=
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;Started, pid = 126109
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU usage: 0.007 secs
09/05/2024 10:14:50;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: Memory usage: mem=2796kb
09/05/2024 10:14:50;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 terminated
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;Terminated
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:50;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:50;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 256 (elapsed time: 0.6490)
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:51;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:51;0008;pbs_python;Job;2190987.x3003c0s17b0n0;update_job_usage: No CPU usage data
09/05/2024 10:14:51;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 256 (elapsed time: 0.3052)
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:51;0100;pbs_mom;Job;2190987.x3003c0s17b0n0;Obit sent
09/05/2024 10:14:51;0100;pbs_mom;Req;;Type 14 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:51;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:52;0100;pbs_mom;Req;;Type 6 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:52;0080;pbs_mom;Job;2190987.x3003c0s17b0n0;delete job request received
09/05/2024 10:14:52;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_end, job ID is 2190987.x3003c0s17b0n0
09/05/2024 10:14:52;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190987.x3003c0s17b0n0, event_type 512 (elapsed time: 0.0112)
09/05/2024 10:14:52;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:53;0008;pbs_mom;Job;2190987.x3003c0s17b0n0;kill_job
09/05/2024 10:14:53;0100;pbs_mom;Req;;Type 1 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:53;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2182)
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:54;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;x3001c0s7b3n0 cput=00:00:00 mem=0kb
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_launch, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_epilogue, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 2048 (elapsed time: 0.2699)
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_cgroups;env orig: TZ=UTC,PATH=/bin:/usr/bin,PBS_O_HOME=/home/prodops,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=prodops,PBS_O_PATH=/home/prodops/cylc-run/mss-aa999:/ccrspfs1/project/app/cylc/7.9.6/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/home/prodops/cylc-run/mss-aa999/bin:/ccrspfs1/project/app/rose/2019.01.5/bin:.:/home/prodops/bin:/scratch/rhuva/anaconda3/bin:/scratch/rhuva/anaconda3/condabin:.:/home/prodops/bin:/project/app/cdo/1.9.5/bin:/project/app/eccodes/2.23.0/cray/13.0.2/bin:/project/app/wgrib2/3.1.1/bin:/project/app/ncl/6.6.2/bin:/project/app/fcm/2021.05.0/bin:/project/app/cylc/7.9.6/bin:/project/app/rose/2019.01.5/sbin:/project/app/rose/2019.01.5/bin:/project/app/pygtk/2.24.0-ucs4/bin:/project/app/pygobject/2.21.5-ucs4/bin:/project/app/python/2.7.18-ucs4/bin:/opt/cray/pe/pals/1.1.6/bin:/opt/cray/pe/craype/2.7.15/bin:/opt/cray/pe/cce/13.0.2/binutils/x86_64/x86_64-pc-linux-gnu/bin:/opt/cray/pe/cce/13.0.2/binutils/cross/x86_64-aarch64/bin:/opt/cray/pe/cce/13.0.2/utils/x86_64/bin:/opt/cray/pe/cce/13.0.2/bin:/opt/cray/pe/perftools/22.04.0/bin:/opt/cray/pe/papi/6.0.0.14/bin:/opt/cray/libfabric/1.11.0.4.125/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/home/prodops/bin:/usr/local/bin:/usr/bin:/bin:/opt/c3/bin:/opt/pbs/bin:/sbin:/bin:/opt/cray/pe/bin,PBS_O_MAIL=/var/spool/mail/prodops,PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=rnd,PBS_O_HOST=x3003c0s13b0n0input,HOME=/home/prodops,LOGNAME=prodops,PBS_JOBNAME=stageJob,PBS_JOBID=2190988.x3003c0s17b0n0,PBS_QUEUE=rnd,SHELL=/bin/bash,USER=prodops,PBS_JOBCOOKIE=70D87BA45B2BD3382D97B2524A102E4F,PBS_NODENUM=0,PBS_TASKNUM=1,PBS_MOMPORT=15003,OMP_NUM_THREADS=97,NCPUS=97,PBS_NODEFILE=/var/spool/pbs/aux/2190988.x3003c0s17b0n0,TMPDIR=/var/tmp/pbs.2190988.x3003c0s17b0n0,PBS_JOBDIR=/home/prodops,PBS_ENVIRONMENT=PBS_BATCH,ENVIRONMENT=BATCH
... ...
09/05/2024 10:14:53;0100;pbs_mom;Req;;Type 1 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:53;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 2190988.x3003c0s17b0n0
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['cpu/0/cpuid', 'rwm', '*']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidiactl', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-modeset', 'rwm']
09/05/2024 10:14:54;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: ['nvidia-uvm-tools', 'rwm']
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/2190988.x3003c0s17b0n0/
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Assigned resources: {'cpuset.cpus': [0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6, 134, 7, 135, 8, 136, 9, 137, 10, 138, 11, 139, 12, 140, 13, 141, 14, 142, 15, 143, 16, 144, 17, 145, 18, 146, 19, 147, 20, 148, 21, 149, 22, 150, 23, 151, 24, 152, 25, 153, 26, 154, 27, 155, 28, 156, 29, 157, 30, 158, 31, 159, 32, 160, 33, 161, 34, 162, 35, 163, 36, 164, 37, 165, 38, 166, 39, 167, 40, 168, 41, 169, 42, 170, 43, 171, 44, 172, 45, 173, 46, 174, 47, 175, 48, 176, 49, 177, 50, 178, 51, 179, 52, 180, 53, 181, 54, 182, 55, 183, 56, 184, 57, 185, 58, 186, 59, 187, 60, 188, 61, 189, 62, 190, 63, 191, 64, 192, 65, 193, 66, 194, 67, 195, 68, 196, 69, 197, 70, 198, 71, 199, 72, 200, 73, 201, 74, 202, 75, 203, 76, 204, 77, 205, 78, 206, 79, 207, 80, 208, 81, 209, 82, 210, 83, 211, 84, 212, 85, 213, 86, 214, 87, 215, 88, 216, 89, 217, 90, 218, 91, 219, 92, 220, 93, 221, 94, 222, 95, 223, 96, 224], 'cpuset.mems': [4, 2, 0, 7, 5, 3, 1, 6], 'mem': 416611827712}
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Key: devices not found in assigned
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU percent: 0
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: CPU usage: 0.000 secs
09/05/2024 10:14:54;0008;pbs_python;Job;2190988.x3003c0s17b0n0;update_job_usage: Memory usage: mem=0b
09/05/2024 10:14:54;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 2190988.x3003c0s17b0n0, event_type 64 (elapsed time: 0.2182)
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 3 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 5 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;no active tasks
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with suspend
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;suspend failed, job substate = 41
09/05/2024 10:14:54;0080;pbs_mom;Req;req_reject;Reject reply code=15016, aux=0, type=18, from root@172.23.0.4:15001
09/05/2024 10:14:54;0100;pbs_mom;Req;;Type 18 request received from root@172.23.0.4:15001, sock=1
09/05/2024 10:14:54;0080;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job request received
09/05/2024 10:14:54;0004;pbs_mom;Job;2190988.x3003c0s17b0n0;signal job with Rerun
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job
09/05/2024 10:14:54;0100;pbs_mom;Job;2190988.x3003c0s17b0n0;task 00000001 cput=00:00:00
09/05/2024 10:14:54;0008;pbs_mom;Job;2190988.x3003c0s17b0n0;kill_job