Output files are not copying back to PBS_O_WORKDIR

Hello Team,

We have a cluster of PBS.
PBS server is ruuning of RHEL 7.9 OS
PBS compute/client server is running RHEL 8.8 OS
PBS versions are same on both servers.
Result files are not copying back to $PBS_O_WORKDIR and job stays in running state even after producing output files.

PBS version: - pbs_version = 2020.1.4.20210506140333

Can some one help in this topic?

Is there an error in the log file indicating a file processing error? I don’t know why the job hasn’t exited, and PBS will send an email *after( completion that there was a file processing error, but it doesn’t complete it might not be sending the email then? (mon logs)

When I restart pbs on client node while job is in running state, the job copies back to working directory and finishes itself.
Why is this behaviour ? Any config I’m missing?

I am not sure/don’t know why.

But… have you tried a simple echo to standard out from your job script and nothing else? Different script or code you are executing? Does a hello world get hug up?

You might want to increase the log level as well

on the execution node:
in $PBS_HOME/mom_priv/config

$logevent 0xffffffff
# $logevent 255

and hup pbs

Will jump up your log detail significantly (and when done go back to 255)

Yes simple echo to standard output works.
Yes I’m using different script to execute PBS job.
I increased the log level.

These are the logs

11/10/2023 12:18:09;0008;pbs_mom;Job;7890.skynet;Started, pid = 277818
11/10/2023 12:18:19;0800;pbs_mom;n/a;mom_get_sample;nprocs: 613, cantstat: 2, nomem: 0, skipped: 533, cached: 0
11/10/2023 12:18:36;0800;pbs_mom;n/a;mom_get_sample;nprocs: 613, cantstat: 2, nomem: 0, skipped: 533, cached: 0
11/10/2023 12:18:59;0800;pbs_mom;n/a;mom_get_sample;nprocs: 614, cantstat: 2, nomem: 0, skipped: 534, cached: 0
11/10/2023 12:19:28;0800;pbs_mom;n/a;mom_get_sample;nprocs: 614, cantstat: 2, nomem: 0, skipped: 534, cached: 0
11/10/2023 12:19:36;0004;pbs_mom;Svr;pbs_mom;rpp_retry changed from 10 to 30
11/10/2023 12:20:02;0800;pbs_mom;n/a;mom_get_sample;nprocs: 616, cantstat: 2, nomem: 0, skipped: 536, cached: 0
11/10/2023 12:20:42;0800;pbs_mom;n/a;mom_get_sample;nprocs: 615, cantstat: 2, nomem: 0, skipped: 535, cached: 0
11/10/2023 12:21:28;0800;pbs_mom;n/a;mom_get_sample;nprocs: 617, cantstat: 2, nomem: 0, skipped: 537, cached: 0
11/10/2023 12:22:20;0800;pbs_mom;n/a;mom_get_sample;nprocs: 611, cantstat: 2, nomem: 0, skipped: 531, cached: 0
11/10/2023 12:23:19;0800;pbs_mom;n/a;mom_get_sample;nprocs: 612, cantstat: 2, nomem: 0, skipped: 532, cached: 0

There should be more information there around your job as well, my “canstat” is 0 so I am wondering if you canstat indicates a problem with permissions even reading the the script you are trying to execute.

Maybe try a simple hello word script submitted to pbs to as a test

Do you see the job number in there or the obit if it completed or a copy request at the end?


11/10/2023 09:22:57;0400;pbs_mom;Job;is_request;received ack obits = 1

...

11/10/2023 09:22:57;0080;pbs_mom;Job;558.hpc-XXXX;copy file request received

Also information about the job and the resources assigned should be in there as well.

Simple echo and sleep script is getting executed successfully and output/error files are copying back it workdir.

After I submit the job for the concerned script this is line it records in momlogs file.

11/15/2023 13:54:54;0008;pbs_mom;Job;7942.skynet;Started, pid = 12755

After this nothing records in mom logs untill I restart pbs then below logs appear

11/15/2023 12:34:56;0002;pbs_mom;Svr;pbs_mom;caught signal 15
11/15/2023 12:34:56;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
11/15/2023 12:34:56;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
11/15/2023 12:34:56;0002;pbs_mom;Svr;pbs_mom;Is down
11/15/2023 12:34:56;0002;pbs_mom;Svr;Log;Log closed
11/15/2023 12:34:57;0002;pbs_mom;Svr;Log;Log opened
11/15/2023 12:34:57;0002;pbs_mom;Svr;pbs_mom;pbs_version=2020.1.4.20210506140333

Got me. The echo is working, and the simple python mpi hello world?

This is python with mpi4py.

#!/usr/bin/env python

import sys
from mpi4py import MPI

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

print(f"I am process rank {rank}, So {rank+1} of {size}. This process running on {name}")

This is simple standard out, if echo is working I imagine this would wok too, though you should try a couple of different chunk (-lselect=1,2,3 and -lplace=scatter across a couple of nodes). maybe start here and add a little complexity and start writing files directly.

edit: this requires that mpiprocs be specified as part of the select.

Otherwise, I am wondering if it is your code hanging up or something.