Hi,
After restarting the node, it started to behave strangely. Although other nodes in the cluster can run the jobs, this specific node gives an error. Restarting the node and PBS did not solve the problem. Below is the output of error file and log file. Any help is appreciated.
////////Output of error file.//////
Exception: [‘.//home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma/6239[32].hep-node0/madevent/bin/internal/restore_data’, ‘default’] fails with no such file or directory
mv: cannot stat ‘./Events/GridRun_32/unweighted_events.lhe.gz’: No such file or directory
cp: cannot stat ‘/home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma/6239[32].hep-node0/hep-node7_events_32.lhe.gz’: No such file or directory
////output if log file////
07/20/2022 18:45:47;0008;pbs_mom;Job;6239[30].hep-node0;Started, pid = 3811
07/20/2022 18:45:47;0008;pbs_mom;Job;6239[31].hep-node0;Started, pid = 3812
07/20/2022 18:45:47;0008;pbs_mom;Job;6239[32].hep-node0;Started, pid = 3815
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[29].hep-node0;task 00000001 terminated
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;Terminated
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[29].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;kill_job
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[29].hep-node0;hep-node7 cput=00:00:15 mem=7248kb
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[29].hep-node0;Obit sent
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[29].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:12;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[29].hep-node0;copy file request received
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[31].hep-node0;task 00000001 terminated
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[31].hep-node0;Terminated
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[31].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[31].hep-node0;kill_job
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[31].hep-node0;hep-node7 cput=00:00:15 mem=7260kb
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[31].hep-node0;Obit sent
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[30].hep-node0;task 00000001 terminated
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[30].hep-node0;Terminated
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[30].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:12;0008;pbs_mom;Job;6239[30].hep-node0;kill_job
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[30].hep-node0;hep-node7 cput=00:00:15 mem=7248kb
07/20/2022 19:08:12;0100;pbs_mom;Job;6239[30].hep-node0;Obit sent
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[31].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[30].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:12;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[31].hep-node0;copy file request received
07/20/2022 19:08:12;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:12;0080;pbs_mom;Job;6239[30].hep-node0;copy file request received
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:13;0080;pbs_mom;Job;6239[32].hep-node0;task 00000001 terminated
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[32].hep-node0;Terminated
07/20/2022 19:08:13;0100;pbs_mom;Job;6239[32].hep-node0;task 00000001 cput=00:00:15
07/20/2022 19:08:13;0008;pbs_mom;Job;6239[32].hep-node0;kill_job
07/20/2022 19:08:13;0100;pbs_mom;Job;6239[32].hep-node0;hep-node7 cput=00:00:15 mem=7252kb
07/20/2022 19:08:13;0100;pbs_mom;Job;6239[32].hep-node0;Obit sent
07/20/2022 19:08:13;0080;pbs_mom;Job;6239[32].hep-node0;Job exited, Server acknowledged Obit
07/20/2022 19:08:13;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:13;0080;pbs_mom;Job;6239[32].hep-node0;copy file request received
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[29].hep-node0;staged 2 items out over 0:00:02
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[29].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[29].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[29].hep-node0;kill_job
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[30].hep-node0;staged 2 items out over 0:00:02
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[32].hep-node0;staged 2 items out over 0:00:01
07/20/2022 19:08:14;0100;pbs_mom;Job;6239[31].hep-node0;staged 2 items out over 0:00:02
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;no active tasks
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;no active tasks
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[30].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[30].hep-node0;kill_job
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[31].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[31].hep-node0;kill_job
07/20/2022 19:08:14;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/20/2022 19:08:14;0080;pbs_mom;Job;6239[32].hep-node0;delete job request received
07/20/2022 19:08:14;0008;pbs_mom;Job;6239[32].hep-node0;kill_job
07/20/2022 19:18:17;0002;pbs_mom;Svr;pbs_mom;caught signal 15
07/20/2022 19:18:17;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
07/20/2022 19:18:17;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
07/20/2022 19:18:17;0002;pbs_mom;Svr;pbs_mom;Is down
07/20/2022 19:18:17;0002;pbs_mom;Svr;Log;Log closed
Thanks, @adarsh for the reply. If scp were a problem, wouldn’t I have the same problem with other nodes as well? But, I am having the issue with only this node.
Please submit / run the above script by updating the correct hostname of ot node ( :host=) , so that it runs on that specific node and see whether it runs or not.
I did as you directed. Still, having the problem. Below is the output of the log file.
Exception: [‘.//home/ali_0/Madgraph/MG5_aMC_v2_6_7/bin/slepton14_TeV/pbs_calistirma/6259[400].hep-node0/madevent/bin/internal/restore_data’, ‘default’] fails with no such file or directory
mv: cannot stat ‘./Events/GridRun_400/unweighted_events.lhe.gz’: No such file or directory
This is the output of mom_logs.
~
~
~
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[402].hep-node0;task 00000001 terminated
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[401].hep-node0;task 00000001 terminated
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[401].hep-node0;Terminated
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[401].hep-node0;task 00000001 cput=00:00:14
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[401].hep-node0;kill_job
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[401].hep-node0;hep-node7 cput=00:00:14 mem=4644kb
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[401].hep-node0;Obit sent
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[402].hep-node0;Terminated
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[402].hep-node0;task 00000001 cput=00:00:13
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[402].hep-node0;kill_job
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[402].hep-node0;hep-node7 cput=00:00:13 mem=4636kb
07/23/2022 08:40:16;0100;pbs_mom;Job;6259[402].hep-node0;Obit sent
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:16;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[400].hep-node0;Job exited, Server acknowledged Obit
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[401].hep-node0;Job exited, Server acknowledged Obit
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[402].hep-node0;Job exited, Server acknowledged Obit
07/23/2022 08:40:16;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[400].hep-node0;copy file request received
07/23/2022 08:40:16;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:16;0080;pbs_mom;Job;6259[401].hep-node0;copy file request received
07/23/2022 08:40:17;0100;pbs_mom;Req;;Type 54 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:17;0080;pbs_mom;Job;6259[402].hep-node0;copy file request received
07/23/2022 08:40:17;0100;pbs_mom;Job;6259[400].hep-node0;staged 2 items out over 0:00:01
07/23/2022 08:40:17;0100;pbs_mom;Job;6259[402].hep-node0;staged 2 items out over 0:00:00
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:17;0100;pbs_mom;Job;6259[401].hep-node0;staged 2 items out over 0:00:01
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[400].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:17;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:18;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:18;0080;pbs_mom;Job;6259[400].hep-node0;delete job request received
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[400].hep-node0;kill_job
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[401].hep-node0;no active tasks
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[402].hep-node0;no active tasks
07/23/2022 08:40:18;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:18;0080;pbs_mom;Job;6259[402].hep-node0;delete job request received
07/23/2022 08:40:18;0008;pbs_mom;Job;6259[402].hep-node0;kill_job
07/23/2022 08:40:18;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
07/23/2022 08:40:18;0080;pbs_mom;Job;6259[401].hep-node0;delete job request received
Thank you @watzinki , did the job run, did you see any .o and .e files in the job submission directory
You can enable the job history and then submit the job qmgr -c "set server job_history_enable=true"
Hello @adarsh again. When I tried to submit a different job, I got the following error.
Do you think problem is due to NFS related or something else?
/var/spool/pbs/mom_priv/jobs/6291.hep-node0.SC: line 59: ./merge.pl: Permission denied
Seems like permissions issue. Please always make sure whether you can reach/execute that script on that compute node first. Always use absolute path in the job scripts.