We have a cluster of PBS.
PBS server is ruuning of RHEL 7.9 OS
PBS compute/client server is running RHEL 8.8 OS
PBS versions are same on both servers.
Result files are not copying back to $PBS_O_WORKDIR and job stays in running state even after producing output files.
Is there an error in the log file indicating a file processing error? I don’t know why the job hasn’t exited, and PBS will send an email *after( completion that there was a file processing error, but it doesn’t complete it might not be sending the email then? (mon logs)
When I restart pbs on client node while job is in running state, the job copies back to working directory and finishes itself.
Why is this behaviour ? Any config I’m missing?
But… have you tried a simple echo to standard out from your job script and nothing else? Different script or code you are executing? Does a hello world get hug up?
You might want to increase the log level as well
on the execution node:
in $PBS_HOME/mom_priv/config
$logevent 0xffffffff
# $logevent 255
and hup pbs
Will jump up your log detail significantly (and when done go back to 255)
There should be more information there around your job as well, my “canstat” is 0 so I am wondering if you canstat indicates a problem with permissions even reading the the script you are trying to execute.
Maybe try a simple hello word script submitted to pbs to as a test
Do you see the job number in there or the obit if it completed or a copy request at the end?
After this nothing records in mom logs untill I restart pbs then below logs appear
11/15/2023 12:34:56;0002;pbs_mom;Svr;pbs_mom;caught signal 15
11/15/2023 12:34:56;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
11/15/2023 12:34:56;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
11/15/2023 12:34:56;0002;pbs_mom;Svr;pbs_mom;Is down
11/15/2023 12:34:56;0002;pbs_mom;Svr;Log;Log closed
11/15/2023 12:34:57;0002;pbs_mom;Svr;Log;Log opened
11/15/2023 12:34:57;0002;pbs_mom;Svr;pbs_mom;pbs_version=2020.1.4.20210506140333
Got me. The echo is working, and the simple python mpi hello world?
This is python with mpi4py.
#!/usr/bin/env python
import sys
from mpi4py import MPI
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
print(f"I am process rank {rank}, So {rank+1} of {size}. This process running on {name}")
This is simple standard out, if echo is working I imagine this would wok too, though you should try a couple of different chunk (-lselect=1,2,3 and -lplace=scatter across a couple of nodes). maybe start here and add a little complexity and start writing files directly.
edit: this requires that mpiprocs be specified as part of the select.
Otherwise, I am wondering if it is your code hanging up or something.