Jobs stuck in R status after power failure

Hi!

I’ve been facing the following issue: in a cluster using OpenPBS 23.06.06, when there is a power failure and the headnode and nodes reboot, the jobs return stuck in R status in the queue, but with Time Use = 0 and without executing any process on the node. The nodes have a stateless image.

I tried increasing node_fail_requeue, but without success. The only messages in the log about the jobs stuck in this state are:

06/23/2024 11:27:50;0100;Server@cluster;Job;5118.cluster;enqueuing into cpu, state R hop 1
06/23/2024 11:27:50;0086;Server@cluster;Job;5118.cluster;Requeueing job, substate: 42 Requeued in queue: cpu
06/23/2024 11:27:50;0100;Server@cluster;Node;n02;set_vnode_state;vnode.state=0x102 vnode_o.state=0x102 vnode.last_state_change_time=1719152870 vnode_o.last_state_change_time=1718448592 state_bits=0xffffffffffffffaf state_bit_op_type_str=Nd_State_And state_bit_op_type_enum=2

Any comments or suggestions are appreciated.

Thank you!
Nícolas

  1. when the node reboots , does it gets a new image ? does the node have contents of the job spool directory ( var/spool/pbs/spool/ ) that was run on that node previously ?
  2. does the node access freshly submitted jobs if there are resources availalbe on that node ?
  3. if you could share the complete server logs, scheduler logs and mom logs on the day of this incident , then that would be helpful and community members can check and give feedback.

Hi Adarsh!

Thank you for your response!

  1. When the nodes reboot, they receive a new image. This means that the /var/spool/pbs/ directory before the reboot is completely lost.
  2. If new jobs enter the node that has jobs stuck in R, and there are available resources, this new job runs without issues.
  3. Below are the server logs, scheduler logs, and mom logs from the day of the incident.

Server log: server_logs - Pastebin.com
Scheduler log: sched_logs - Pastebin.com
Mom log: mom_logs - Pastebin.com

Note: The machines were restarted at approximately 11:27, and since the nodes receive a new image, their mom logs only contain entries from the moment they were rebooted.

Thanks!
Nícolas

Thank you Nicolas for the above logs and description, much appreciated.

It seems mom_logs does not have the entry for the job id 5118 , when the nodes are rebooted the mom_logs are also flushed it seems. It seems his needs to be caught when you see it happening but difficult as the system needs to be rebooted. Would it be possible to symbolic link the mom_logs folder to a shared drive ?, just to make sure when the system is booted , atleast we got the logs.

Hi Ardash!

Thank you for your response. I performed the test on another node with the same characteristics, using the mom_logs as a symbolic link to another shared directory as indicated. The tested job has job ID 6040. Below, I am sending the log stored in the shared directory:

06/28/2024 12:08:20;0002;pbs_mom;Svr;Log;Log opened
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;pbs_version=23.06.06
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;hostname=n05;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost4.localdomain4 
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;ipv4 interface eno8303: n05.cluster.br 
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;ipv4 interface ib0: n05-ib0.cluster.br 
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: localhost6.localdomain6 
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;ipv6 interface eno8303: n05 
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;ipv6 interface ib0: n05 
06/28/2024 12:08:20;0100;pbs_mom;Svr;parse_config;file config
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.255.254 as authorized
06/28/2024 12:08:20;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
06/28/2024 12:08:20;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.5 as authorized
06/28/2024 12:08:20;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
06/28/2024 12:08:20;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
06/28/2024 12:08:20;0086;pbs_mom;Svr;pbs_mom;Found hook pbs_cgroups type=site
06/28/2024 12:08:20;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_power type=pbs
06/28/2024 12:08:20;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_cray_atom type=pbs
06/28/2024 12:08:20;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_alps_inventory_check type=pbs
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[2] = {PBS_cray_atom, order=100, type=1, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_end), alarm=300, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[3] = {PBS_alps_inventory_check, order=1, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_begin hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_begin hook[1] = {PBS_cray_atom, order=100, type=1, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_end), alarm=300, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_begin hook[2] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_prologue hook[0] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_launch hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_epilogue hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_epilogue hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_end hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_end hook[1] = {PBS_cray_atom, order=100, type=1, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_end), alarm=300, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_end hook[2] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[0] = {PBS_alps_inventory_check, order=1, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[1] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[2] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;exechost_startup hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;exechost_startup hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(periodic,execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_attach hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_resize hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_abort hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_postsuspend hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0080;pbs_mom;Hook;print_hook;execjob_preresume hook[0] = {pbs_cgroups, order=100, type=0, enabled=0 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,execjob_resize,execjob_abort,execjob_postsuspend,execjob_preresume,exechost_periodic,exechost_startup), alarm=90, freq=120}
06/28/2024 12:08:20;0001;pbs_mom;Svr;pbs_mom;proc_get_btime, fscanf failed. ERR : Inappropriate ioctl for device
06/28/2024 12:08:20;0002;pbs_mom;n/a;ncpus;hyperthreading disabled
06/28/2024 12:08:20;0002;pbs_mom;n/a;initialize;pcpus=128, OS reports 128 cpu(s)
06/28/2024 12:08:20;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP authentication method = resvport
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 172.26.0.5:15003,127.0.0.1:15003,172.26.0.5:15003,172.27.0.5:15003
06/28/2024 12:08:20;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
06/28/2024 12:08:20;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 16384
06/28/2024 12:08:20;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
06/28/2024 12:08:20;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm cluster:17001
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Thread ready
06/28/2024 12:08:20;0006;pbs_mom;Fil;pbs_mom;Version 23.06.06, started, initialization type = 0
06/28/2024 12:08:20;0002;pbs_mom;Svr;pbs_mom;Mom pid = 51127 ready, using ports Server:15001 MOM:15002 RM:15003
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.26.0.5:15003 to pbs_comm cluster:17001
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.27.0.5:15003 to pbs_comm cluster:17001
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm cluster:17001
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:20;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:20;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at cluster:15001, stream:0
06/28/2024 12:08:22;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:22;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:22;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.1 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.2 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.3 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.4 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.1.1 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.1.2 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.1 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.2 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.3 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.4 as authorized
06/28/2024 12:08:22;0002;pbs_mom;Svr;pbs_mom;ReplyHello from server at 172.26.255.254:15001
06/28/2024 12:08:22;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:08:22;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:14:15;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:14:15;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:14:15;0100;pbs_mom;Req;;Type 1 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:14:15;0100;pbs_mom;Req;;Type 3 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:14:15;0100;pbs_mom;Req;;Type 5 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:14:15;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:14:15;0008;pbs_mom;Job;6040.cluster;Started, pid = 51186
06/28/2024 12:14:17;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:14:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:14:43;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:15:05;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:15:33;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:16:07;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:16:14;0002;pbs_mom;Svr;pbs_mom;caught signal 15
06/28/2024 12:16:14;0008;pbs_mom;Job;6040.cluster;kill_job
06/28/2024 12:16:14;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
06/28/2024 12:16:14;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
06/28/2024 12:16:14;0002;pbs_mom;Svr;pbs_mom;Is down
06/28/2024 12:16:14;0002;pbs_mom;Svr;Log;Log closed
06/28/2024 12:20:27;0002;pbs_mom;Svr;Log;Log opened
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;pbs_version=23.06.06
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;hostname=n05;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost4.localdomain4 
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface eno8303: n05.cluster.br 
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface ib0: n05-ib0.cluster.br 
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: localhost6.localdomain6 
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;ipv6 interface eno8303: n05 
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;ipv6 interface ib0: n05 
06/28/2024 12:20:27;0100;pbs_mom;Svr;parse_config;file config
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.255.254 as authorized
06/28/2024 12:20:27;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
06/28/2024 12:20:27;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
06/28/2024 12:20:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.5 as authorized
06/28/2024 12:20:27;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
06/28/2024 12:20:27;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
06/28/2024 12:20:27;0001;pbs_mom;Svr;pbs_mom;proc_get_btime, fscanf failed. ERR : Inappropriate ioctl for device
06/28/2024 12:20:27;0002;pbs_mom;n/a;ncpus;hyperthreading disabled
06/28/2024 12:20:28;0002;pbs_mom;n/a;initialize;pcpus=128, OS reports 128 cpu(s)
06/28/2024 12:20:28;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP authentication method = resvport
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 172.26.0.5:15003,127.0.0.1:15003,172.26.0.5:15003,172.27.0.5:15003
06/28/2024 12:20:28;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
06/28/2024 12:20:28;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 16384
06/28/2024 12:20:28;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
06/28/2024 12:20:28;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm cluster:17001
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Thread ready
06/28/2024 12:20:28;0006;pbs_mom;Fil;pbs_mom;Version 23.06.06, started, initialization type = 0
06/28/2024 12:20:28;0002;pbs_mom;Svr;pbs_mom;Mom pid = 4872 ready, using ports Server:15001 MOM:15002 RM:15003
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.26.0.5:15003 to pbs_comm cluster:17001
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.27.0.5:15003 to pbs_comm cluster:17001
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm cluster:17001
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:28;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:28;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at cluster:15001, stream:0
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.1 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.2 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.3 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.0.4 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.1.1 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.1.2 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.1 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.2 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.3 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.26.2.4 as authorized
06/28/2024 12:20:30;0002;pbs_mom;Svr;pbs_mom;ReplyHello from server at 172.26.255.254:15001
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;resourcedef;copy hook-related file request received
06/28/2024 12:20:30;0d80;pbs_mom;TPP;pbs_mom(Thread 0);handle_incoming_data;Increased scratch size for tfd=13 to 16384
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_alps_inventory_check.HK;copy hook-related file request received
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_alps_inventory_check.PY;copy hook-related file request received
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_cray_atom.HK;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_cray_atom.CF;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_cray_atom.PY;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_power.HK;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_power.CF;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;PBS_power.PY;copy hook-related file request received
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.HK;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.CF;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.PY;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.PY;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.PY;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.PY;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
06/28/2024 12:20:30;0100;pbs_mom;Req;;Type 85 request received from root@172.26.255.254:15001, sock=0
06/28/2024 12:20:30;0080;pbs_mom;Hook;pbs_cgroups.PY;copy hook-related file request received
06/28/2024 12:20:30;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box

Thanks,
Nícolas

Thank you Nicolas, if you seee the pbs_mom process caught signal 15 (SIGTERM)
I am not sure what caused it to gracefully shutdown.

Next step is to run strace on the pbs_mom and then submit a job and check

  1. boot up the node
  2. trace -p -tt -o /path/to/shared/folder/pbs_mom_strace.txt -ff -s 4096
  3. submit a job
  4. now the strace might give us some hints

06/28/2024 12:16:14;0002;pbs_mom;Svr;pbs_mom;caught signal 15
06/28/2024 12:16:14;0008;pbs_mom;Job;6040.cluster;kill_job
06/28/2024 12:16:14;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
06/28/2024 12:16:14;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
06/28/2024 12:16:14;0002;pbs_mom;Svr;pbs_mom;Is down
06/28/2024 12:16:14;0002;pbs_mom;Svr;Log;Log closed

Hi Ardash!

Thank you for your response. Apologies, the signal 15 (SIGTERM) was sent because I purposely rebooted the node to check if the logs would continue writing to the same file, using the mom_logs as a symbolic link to the shared directory.

In a new test, the job completes successfully without any interruptions, such as a shutdown.

Thank you,
Nícolas

1 Like

Hi Ardash!

Is it possible that this issue occurs because every time the system is rebooted, the entire /var/spool/pbs directory from before the reboot is lost? This includes losing the spool and mom_priv directories, which contain information about the jobs that were running.

Would a stateful image or mounting /var/spool/pbs on a local disk perhaps solve the issue?

Thank you!
Nícolas

Having , the /var/spool/pbs persistent would be helpful., also using the below variable in the /etc/pbs.conf for the pbs on the compute node might help to point the location with respect to each of the compute nodes.

PBS_MOM_HOME=/var/spool/pbs_cnode1

Ref: AG-40, PBS Professional 2022.1 Administrator’s Guide
If PBS_MOM_HOME is present in the pbs.conf file, pbs_mom will use that directory for its “home” instead of PBS_HOME.

Hi Ardash!

Thank you for your response! Setting PBS_MOM_HOME to a persistent directory seems to have resolved the issue. Now, after the machines reboot, the jobs restart successfully and no longer get stuck in the queue as before.

Thank you very much!
Nícolas

1 Like