Not producing output file after running the job

Hi,
After restarting the node, it started to behave strangely. Although other nodes in the cluster can run the jobs, this specific node gives an error. Restarting the node and PBS did not solve the problem. Below is the output of error file and log file. Any help is appreciated.
////////Output of error file.//////
Exception: [‘.//home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma/6239[32].hep-node0/madevent/bin/internal/restore_data’, ‘default’] fails with no such file or directory
mv: cannot stat ‘./Events/GridRun_32/unweighted_events.lhe.gz’: No such file or directory
cp: cannot stat ‘/home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma/6239[32].hep-node0/hep-node7_events_32.lhe.gz’: No such file or directory

////output if log file////

07/20/2022 18:45:47;0008;pbs_mom;Job;6239[30].hep-node0;Started, pid = 3811
07/20/2022 18:45:47;0008;pbs_mom;Job;6239[31].hep-node0;Started, pid = 3812
07/20/2022 18:45:47;0008;pbs_mom;Job;6239[32].hep-node0;Started, pid = 3815
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[29].hep-node0;task 00000001 terminated
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;Terminated
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[29].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;kill_job
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[29].hep-node0;hep-node7 cput=00:00:15 mem=7248kb
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[29].hep-node0;Obit sent
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[29].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:12;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[29].hep-node0;copy file request received
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[31].hep-node0;task 00000001 terminated
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[31].hep-node0;Terminated
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[31].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[31].hep-node0;kill_job
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[31].hep-node0;hep-node7 cput=00:00:15 mem=7260kb
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[31].hep-node0;Obit sent
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[30].hep-node0;task 00000001 terminated
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[30].hep-node0;Terminated
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[30].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[30].hep-node0;kill_job
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[30].hep-node0;hep-node7 cput=00:00:15 mem=7248kb
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[30].hep-node0;Obit sent
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[31].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[30].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:12;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[31].hep-node0;copy file request received
07/20/2022 19:08:12;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[30].hep-node0;copy file request received
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:13;0080;pbs_mom;Job;6239[32].hep-node0;task 00000001 terminated
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[32].hep-node0;Terminated
07/20/2022 19:08:13;0100;pbs_mom;Job;6239[32].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[32].hep-node0;kill_job
07/20/2022 19:08:13;0100;pbs_mom;Job;6239[32].hep-node0;hep-node7 cput=00:00:15 mem=7252kb
07/20/2022 19:08:13;0100;pbs_mom;Job;6239[32].hep-node0;Obit sent
07/20/2022 19:08:13;0080;pbs_mom;Job;6239[32].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:13;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:13;0080;pbs_mom;Job;6239[32].hep-node0;copy file request received
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[29].hep-node0;staged 2 items out over 0:00:02
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[29].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[29].hep-node0;kill_job
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[30].hep-node0;staged 2 items out over 0:00:02
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[32].hep-node0;staged 2 items out over 0:00:01
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[31].hep-node0;staged 2 items out over 0:00:02
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[30].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;kill_job
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[31].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;kill_job
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[32].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;kill_job
07/20/2022 19:18:17;0002;pbs_mom;Svr;pbs_mom;caught signal 15
07/20/2022 19:18:17;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
07/20/2022 19:18:17;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
07/20/2022 19:18:17;0002;pbs_mom;Svr;pbs_mom;Is down
07/20/2022 19:18:17;0002;pbs_mom;Svr;Log;Log closed

Please share your job script contents here

#PBS -l select=1:ncpus=1:mem=800mb
#PBS -l walltime=400:0:0
#PBS -J 101-200:1
#PBS -S /bin/bash
#cd /home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/third_ttbar_pythia/PBS_calistirma ###$PBS_O_WORKDIR

NP=$(wc -l $PBS_NODEFILE | awk ‘{print $1}’)
echo “Total CPU count = $NP”

#echo “Script begins here”
#cat
echo “Running on:”
node_name=$(cat ${PBS_NODEFILE})
echo “${node_name}”
echo "Program Output begins: "
cd $PBS_O_WORKDIR

mkdir $PBS_JOBID

chmod 700 $PBS_JOBID

cd $PBS_JOBID
#############################
gridpack_konumu=“/home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma”

scp ${gridpack_konumu}/run_01_gridpack.tar.gz .

#scp /home/ali_0/Madgraph/MG5_aMC_v2_9_3/bin/lhapdf_deneme/pbs_calistirma_lhapdfli/run_01_gridpack.tar.gz .

tar -xvf run_01_gridpack.tar.gz --warning=no-timestamp
chmod a+x run.sh

###Lets keep a copy of the submission script in the event directory
pwd=$(pwd)
target_dir=“${pwd}/madevent/Events/.”
scp $0 ${target_dir}

sed -i ‘s+${DIR}/bin/gridrun+python2.7 ${DIR}/bin/gridrun+’ run.sh

sed -i 's+${DIR}/Events/GridRun_${seed}/unweighted_events.lhe.gz events.lhe.gz+${DIR}/Events/GridRun_${seed}/unweighted_events.lhe.gz '“${node_name}”‘events${PBS_ARRAY_INDEX}.lhe.gz+’ run.sh

bash run.sh 100000 $PBS_ARRAY_INDEX

###lets copy the event file

scp ${gridpack_konumu}/$PBS_JOBID/${node_name}events${PBS_ARRAY_INDEX}.lhe.gz ${gridpack_konumu}/Events_10M_2/.

cd $PBS_O_WORKDIR
rm -rf $PBS_JOBID

Please check the scp syntax.
Ref: SCP Linux - Securely Copy Files Using SCP examples

Thanks, @adarsh for the reply. If scp were a problem, wouldn’t I have the same problem with other nodes as well? But, I am having the issue with only this node.

#PBS -l select=1:ncpus=1:mem=800mb:host=computenode
#PBS -l walltime=400:0:0
#PBS -J 101-102:1
#PBS -S /bin/bash

NP=$(wc -l $PBS_NODEFILE | awk ‘{print $1}’)
echo “Total CPU count = $NP”
env

Please submit / run the above script by updating the correct hostname of ot node ( :host=) , so that it runs on that specific node and see whether it runs or not.

I did as you directed. Still, having the problem. Below is the output of the log file.
Exception: [‘.//home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma/6259[400].hep-node0/madevent/bin/internal/restore_data’, ‘default’] fails with no such file or directory
mv: cannot stat ‘./Events/GridRun_400/unweighted_events.lhe.gz’: No such file or directory

This is the output of mom_logs.
~
~
~
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[402].hep-node0;task 00000001 terminated
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[401].hep-node0;task 00000001 terminated
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[401].hep-node0;Terminated
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[401].hep-node0;task 00000001 cput=00:00:14
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[401].hep-node0;kill_job
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[401].hep-node0;hep-node7 cput=00:00:14 mem=4644kb
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[401].hep-node0;Obit sent
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[402].hep-node0;Terminated
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[402].hep-node0;task 00000001 cput=00:00:13
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[402].hep-node0;kill_job
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[402].hep-node0;hep-node7 cput=00:00:13 mem=4636kb
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[402].hep-node0;Obit sent
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[400].hep-node0;Job exited, Server acknowledged Obit
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[401].hep-node0;Job exited, Server acknowledged Obit
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[402].hep-node0;Job exited, Server acknowledged Obit
07/23/2022 08:40:16;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[400].hep-node0;copy file request received
07/23/2022 08:40:16;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[401].hep-node0;copy file request received
07/23/2022 08:40:17;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:17;0080;pbs_mom;Job;6259[402].hep-node0;copy file request received
07/23/2022 08:40:17;0100;pbs_mom;Job;6259[400].hep-node0;staged 2 items out over 0:00:01
07/23/2022 08:40:17;0100;pbs_mom;Job;6259[402].hep-node0;staged 2 items out over 0:00:00
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:17;0100;pbs_mom;Job;6259[401].hep-node0;staged 2 items out over 0:00:01
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:18;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:18;0080;pbs_mom;Job;6259[400].hep-node0;delete job request received
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[400].hep-node0;kill_job
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:18;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:18;0080;pbs_mom;Job;6259[402].hep-node0;delete job request received
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[402].hep-node0;kill_job
07/23/2022 08:40:18;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:18;0080;pbs_mom;Job;6259[401].hep-node0;delete job request received

To avoid confusion, please run this command line
qsub -l select=1:ncpus=1:mem=800mb:host=computenodename -- /bin/hostname

Please share the qstat -fx of this jobid after submission.

Output is as below.
qstat: PBS is not configured to maintain job history

Thank you @watzinki , did the job run, did you see any .o and .e files in the job submission directory
You can enable the job history and then submit the job
qmgr -c "set server job_history_enable=true"

output of qstat -fx is a below:
Job Id: 6273.hep-node0
Job_Name = STDIN
Job_Owner = ali_0@hep-node0
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.ncpus = 1
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = F
queue = batch
server = hep-node0
Checkpoint = u
ctime = Sun Jul 24 22:42:32 2022
Error_Path = hep-node0:/home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_Te
V/pbs_calistirma/STDIN.e6273
exec_host = hep-node7/0
exec_vnode = (hep-node7:ncpus=1:mem=819200kb)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Sun Jul 24 22:42:33 2022
Output_Path = hep-node0:/home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_T
eV/pbs_calistirma/STDIN.o6273
Priority = 0
qtime = Sun Jul 24 22:42:32 2022
Rerunable = True
Resource_List.mem = 800mb
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.place = free
Resource_List.select = 1:ncpus=1:mem=800mb:host=hep-node7
Resource_List.walltime = 100:00:00
stime = Sun Jul 24 22:42:32 2022
session_id = 7015
jobdir = /home/ali_0
substate = 92
Variable_List = PBS_O_HOME=/home/ali_0,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=ali_0,
PBS_O_PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/pbs/bin:/opt/pbs/sbin:/u
sr/bin:/opt/pbs/bin,PBS_O_MAIL=/var/spool/mail/ali_0,
PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pb
s_calistirma,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=batch,PBS_O_HOST=hep-node0
comment = Job run at Sun Jul 24 at 22:42 on (hep-node7:ncpus=1:mem=819200kb
) and finished
etime = Sun Jul 24 22:42:32 2022
run_count = 1
Stageout_status = 1
Exit_status = 0
Submit_arguments = -l select=1:ncpus=1:mem=800mb:host=hep-node7 – /bin/hos
tname
executable = jsdl-hpcpa:Executable/bin/hostname</jsdl-hpcpa:Executable>
history_timestamp = 1658691753
project = _pbs_project_default
Submit_Host = hep-node0

.o and .e files are created. While .e file is empty, o. file contains only hostname of the node.

This looks good to me, There are no issues with the compute node.

  • e file is empty , as there are no standard errors
  • o file has the hostname, that is the result of running /bin/hostname command

All good.

1 Like

Thank you @adarsh for your help so far!

Hello @adarsh again. When I tried to submit a different job, I got the following error.
Do you think problem is due to NFS related or something else?
/var/spool/pbs/mom_priv/jobs/6291.hep-node0.SC: line 59: ./merge.pl: Permission denied

Seems like permissions issue. Please always make sure whether you can reach/execute that script on that compute node first. Always use absolute path in the job scripts.