Job completed but no output

Hi all,
After running the following job script, no output file or error file created.

#!/bin/bash

### Thejob name

#PBS -N moose_test

### The number of node and processors per node

#PBS -l nodes=node03:ppn=56

### The maximum time for job running

#PBS -l walltime=48:00:00

### The standard output and error

#PBS -j oe

### The queue for job running

#conda activate moose

#

cd $PBS_O_WORKDIR

NSLOTS=`cat ${PBS_NODEFILE} | wc -l`

echo "This jobs is "$PBS_JOBID@$PBS_QUEUE

uniq -c $PBS_NODEFILE |awk '{print $2":"$1}'

#

date

time mpiexec -n 56 ./*opt -i ./test/tests/kernels/simple_diffusion/simple_diffusion.i

date

exit 0

When I run ssh node03 , I don’t need password.

Does the job run, i.e. what’s the output of qstat -a and qstat -xf, also tracejob jobid ?

Hi, here lists the info of the command you mentioned:

qstat -a

node01: 
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
867.node01              DengChaoQun batch    moose_test        29802     1     56       --   48:00:00 C       --

qstat -xf

<?xml version="1.0"?>
<Data><Job><Job_Id>867.node01</Job_Id><Job_Name>moose_test</Job_Name><Job_Owner>DengChaoQun@node01</Job_Owner><resources_used><cput>00:02:32</cput><vmem>0kb</vmem><walltime>00:00:08</walltime><mem>0kb</mem><energy_used>0</energy_used></resources_used><job_state>C</job_state><queue>batch</queue><server>node01</server><Checkpoint>u</Checkpoint><ctime>1707221317</ctime><Error_Path>node01:/home/DengChaoQun/projects/bees2024/moose_test.e867</Error_Path><exec_host>node03/0-55</exec_host><Hold_Types>n</Hold_Types><Join_Path>oe</Join_Path><Keep_Files>n</Keep_Files><Mail_Points>a</Mail_Points><mtime>1707221329</mtime><Output_Path>node01:/home/DengChaoQun/projects/bees2024/moose_test.o867</Output_Path><Priority>0</Priority><qtime>1707221317</qtime><Rerunable>True</Rerunable><Resource_List><nodes>node03:ppn=56</nodes><walltime>48:00:00</walltime><nodect>1</nodect></Resource_List><session_id>29802</session_id><Variable_List>PBS_O_QUEUE=batch,PBS_O_HOME=/home/DengChaoQun,PBS_O_LOGNAME=DengChaoQun,PBS_O_PATH=/home/DengChaoQun/mpich-4.0.2/install/bin:/home/DengChaoQun/gcc-13.1.0/gcc-install/bin:/opt/torque/bin:/opt/torque/sbin:/home/DengChaoQun/.vscode-server/bin/8b3775030ed1a69b13e4f4c628c612102e30a681/bin/remote-cli:/home/DengChaoQun/mpich-4.0.2/install/bin:/home/DengChaoQun/gcc-13.1.0/gcc-install/bin:/opt/rh/devtoolset-9/root/usr/bin:/opt/torque/bin:/opt/torque/sbin/opt/intel/oneapi/vtune/2022.2.0/bin64:/opt/intel/oneapi/vpl/2022.1.0/bin:/opt/intel/oneapi/mpi/2021.6.0/libfabric/bin:/opt/intel/oneapi/mpi/2021.6.0/bin:/opt/intel/oneapi/mkl/2022.1.0/bin/intel64:/opt/intel/oneapi/itac/2021.6.0/bin:/opt/intel/oneapi/inspector/2022.1.0/bin64:/opt/intel/oneapi/dpcpp-ct/2022.1.0/bin:/opt/intel/oneapi/dev-utilities/2021.6.0/bin:/opt/intel/oneapi/debugger/2021.6.0/gdb/intel64/bin:/opt/intel/oneapi/compiler/2022.1.0/linux/lib/oclfpga/bin:/opt/intel/oneapi/compiler/2022.1.0/linux/bin/intel64:/opt/intel/oneapi/compiler/2022.1.0/linux/bin:/opt/intel/oneapi/clck/2021.6.0/bin/intel64:/opt/intel/oneapi/advisor/2022.1.0/bin64:/opt/torque/bin:/opt/torque/sbin:/home/DengChaoQun/mpich-4.0.2/install/bin:/home/DengChaoQun/gcc-13.1.0/gcc-install/bin:/home/DengChaoQun/miniforge/envs/moose/bin:/home/DengChaoQun/miniforge/condabin:/opt/torque/bin:/opt/torque/sbin:/usr/lib64/qt-3.3/bin:/home/DengChaoQun/perl5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/DengChaoQun/.local/bin:/home/DengChaoQun/bin:/home/DengChaoQun/miniforge/envs/moose/wasp/bin,PBS_O_MAIL=/var/spool/mail/DengChaoQun,PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/DengChaoQun/projects/bees2024,PBS_O_HOST=node01,PBS_O_SERVER=node01</Variable_List><euser>DengChaoQun</euser><egroup>DengChaoQun</egroup><queue_type>E</queue_type><sched_hint>Unable to copy files back - please see the mother superior&apos;s log for exact details.</sched_hint><comment>Job started on Tue Feb 06 at 20:08</comment><etime>1707221317</etime><exit_status>0</exit_status><submit_args>run_moose.pbs</submit_args><start_time>1707221317</start_time><start_count>1</start_count><fault_tolerant>False</fault_tolerant><comp_time>1707221329</comp_time><job_radix>0</job_radix><total_runtime>12.143389</total_runtime><submit_host>node01</submit_host><init_work_dir>/home/DengChaoQun/projects/bees2024</init_work_dir><request_version>1</request_version></Job></Data>

tracejob 867

/var/spool/torque/server_priv/accounting/20240206: Permission denied
/var/spool/torque/mom_logs/20240206: No matching job records located

Job: 867.node01

02/06/2024 20:08:37.607 S    enqueuing into batch, state 1 hop 1
02/06/2024 20:08:37.788 S    Job Modified at request of root@node01
02/06/2024 20:08:37.821 L    Job Run
02/06/2024 20:08:37.788 S    Job Run at request of root@node01
02/06/2024 20:08:37.821 S    Not sending email: User does not want mail of this type.
02/06/2024 20:08:49.931 S    Not sending email: User does not want mail of this type.
02/06/2024 20:08:49.932 S    Exit_status=0 resources_used.cput=152 resources_used.vmem=0kb resources_used.walltime=00:00:08 resources_used.mem=0kb resources_used.energy_used=0

Think this could be why you are not seeing the logs

Are you able to view the mom_logs on node03, may show the reason why it cannot copy the files back.

here are somo info from the mom_logs file:

02/06/2024 20:10:06.995;01;   pbs_mom.3769;Job;TMomFinalizeJob3;job 867.node01 started, pid = 29802
02/06/2024 20:10:14.878;128;   pbs_mom.3769;Job;867.node01;scan_for_terminated: job 867.node01 task 1 terminated, sid=29802
02/06/2024 20:10:14.878;08;   pbs_mom.3769;Job;867.node01;job was terminated
02/06/2024 20:10:14.878;128;   pbs_mom.3769;Svr;preobit_preparation;top
02/06/2024 20:10:19.102;128;   pbs_mom.3769;Job;867.node01;obit sent to server
02/06/2024 20:10:19.107;128;   pbs_mom.3811;Job;867.node01;removed job script
02/06/2024 20:11:32.391;02;   pbs_mom.3769;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
02/06/2024 20:16:32.864;02;   pbs_mom.3769;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
02/06/2024 20:21:37.863;02;   pbs_mom.3769;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
02/06/2024 20:26:42.933;02;   pbs_mom.3769;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0
02/06/2024 20:31:47.887;02;   pbs_mom.3769;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0

On node03 do you see anything in /var/spool/pbs/undelivered/

On node03 can you touch a file in the $PBS_O_WORKDIR?
On node03 is scp installed?

Thanks for your reply!

On node03, I see the the output file 867.node01.OU in /var/spool/pbs/undelivered/ , but no error file.

Since other people can get output file by using the same script, I think the scp is installed on node03.

  • when I run echo $PBS_O_WORKDIR on node03, it returns nothing.
  • and when I run touch $PBS_O_WORKDIR/test_file.txt , it shows:
    touch: cannot touch ‘/test_file.txt’: Permission denied

So maybe this causes the problem.

I am a beginner in PBS scripting. Can you help me solve this problem? Thanks a lot!

As you have specified the -j oe 867.node01.OU will contain standard error.

oe Standard error and standard output are merged into standard output.

Is there anything useful in this file?

You will have to know already the location of $PBS_O_WORKDIR, i.e. the directory that you are submitting from on the server.

it contains the output info of my running case, but nothing useful to solve my problem

Maybe I make a mistake here and caused a misunderstanding, on node03, when I run touch test_file.txt instead of touch $PBS_O_WORKDIR/test_file.txt , it can create a file named test_file.txt .

Is that in the same directory that you submitted from i.e. /home/username/projects/bees2024/ ?

yes, the same directory and on node03 now

If you remove the line :

Does it then output properly?
What are others running?

I removed this line but the new output file is still in the directory /var/spool/pbs/undelivered/

Other people used the same script to run the same case, but his output file would be in the directory that he submitted from.

You could try qsub -koed -o (output_path) -e (error_path) job.sh

Which should directly output standard out and error to the final destination.
Any errors should be in /var/spool/torque/mom_logs/ on node03.

Can I run this command on login node? or have to run on node03?

On login node, where you normally qsub from.

  • I create a directory for_out in /home/username/projects/bees2024/, and run qsub -koe -o ./for_out/out.OU -e ./for_out/error.ER run_moose.pbs, but still no output files

  • The job id is 873, but this time I can’t find 873 related files in the directory /var/spool/pbs/undelivered/

  • Here is the info in /var/spool/torque/mom_logs/ :

02/07/2024 00:12:10.903;01;   pbs_mom.3769;Job;TMomFinalizeJob3;job 873.node01 started, pid = 40049
02/07/2024 00:12:18.662;128;   pbs_mom.3769;Job;873.node01;scan_for_terminated: job 873.node01 task 1 terminated, sid=40049
02/07/2024 00:12:18.662;08;   pbs_mom.3769;Job;873.node01;job was terminated
02/07/2024 00:12:18.662;128;   pbs_mom.3769;Svr;preobit_preparation;top
02/07/2024 00:12:27.052;128;   pbs_mom.3769;Job;873.node01;obit sent to server
02/07/2024 00:12:27.055;128;   pbs_mom.3825;Job;873.node01;removed job script
02/07/2024 00:14:35.506;02;   pbs_mom.3769;Svr;pbs_mom;Torque Mom Version = 6.1.1.1, loglevel = 0

Could you try and use the full path rather than relative?
i.e. -o /home/DengChaoQun/projects/bees2022/out.OU -e /home/DengChaoQun/projects/bees2022/err.ER

I tried the full path, but the result is the same.

Could you please check whether you are using OpenPBSPro or torque.
Could you please share the output of the below commands
qstat --version
qstat -Bf