1 Works fine
2 Stagein, Running works fine but Stageout doesn’t work.
In case 2, MOM LOG shows:
12/18/2024 14:26:55;0001;pbs_mom;Fil;copy_file;Job 26.ip-0A582E84: sys_copy failed, return value=1
12/18/2024 14:26:55;0004;pbs_mom;Fil;26.ip-0A582E84.OU;Unable to copy file 26.ip-0A582E84.OU to ip-0a582e84://FeaJob.py.o26
12/18/2024 14:26:55;0004;pbs_mom;Fil;26.ip-0A582E84.OU;ip-0A582E84: Connection refused
12/18/2024 14:26:55;0001;pbs_mom;Fil;stage_file;Job 26.ip-0A582E84: no wildcards:remote stageout failed for saf112092 from 26.ip-0A582E84.OU to ip-0a582e84://FeaJob.py.o26
12/18/2024 14:26:55;0100;pbs_mom;Job;26.ip-0A582E84;Job files not copied:---->>>>
12/18/2024 14:26:55;0100;pbs_mom;Job;26.ip-0A582E84;Unable to copy file 26.ip-0A582E84.OU to ip-0a582e84://FeaJob.py.o26
12/18/2024 14:26:55;0100;pbs_mom;Job;26.ip-0A582E84;>>> error from copy
Please increase the mom log level and check the detailed mom log.
From the above logs it seems the firewall or ports blocked or network issues.
It seems staging out of stdout /stderr is failing to the location where the qsub was initiated.
Also, if you could share the snippet of your stagein and stageout attribute might help
I rerun the 2 cases:
1- [saf112092@ip-0A582E84 test14]$ qsub FeaJob.py (job 27)
2- [saf112092@ip-0A582E84 /]$ qsub /mnt/data/Public/pbs2/test14/FeaJob.py (job 28)
In case 2 it looks like it is trying to copy:
/opt/pbs/sbin/pbs_rcp -rp 28.ip-0A582E84.OU
whereas it is supposed to use pbs_cp
In case 1 pbs_cp works fine
12/19/2024 12:00:05;0100;pbs_mom;Req;;Type 54 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:00:05;0080;pbs_mom;Job;27.ip-0A582E84;copy file request received
12/19/2024 12:00:05;0008;pbs_mom;Job;27.ip-0A582E84;created the job directory /home/saf112092/pbs.27.ip-0A582E84.x8z
12/19/2024 12:00:06;0100;pbs_mom;Job;27.ip-0A582E84;Staged 1/1 items in over 0:00:01
12/19/2024 12:00:07;0100;pbs_mom;Req;;Type 1 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:00:07;0100;pbs_mom;Req;;Type 3 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:00:07;0100;pbs_mom;Req;;Type 5 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:00:07;0008;pbs_mom;Job;27.ip-0A582E84;created the job directory /home/saf112092/pbs.27.ip-0A582E84.x8z
12/19/2024 12:00:07;0008;pbs_mom;Job;27.ip-0A582E84;Started, pid = 12666
12/19/2024 12:05:38;0080;pbs_mom;Job;27.ip-0A582E84;task 00000001 terminated
12/19/2024 12:05:38;0008;pbs_mom;Job;27.ip-0A582E84;Terminated
12/19/2024 12:05:38;0100;pbs_mom;Job;27.ip-0A582E84;task 00000001 cput=00:00:03
12/19/2024 12:05:38;0008;pbs_mom;Job;27.ip-0A582E84;kill_job
12/19/2024 12:05:38;0100;pbs_mom;Job;27.ip-0A582E84;ip-0A582E86 cput=00:00:03 mem=247480kb
12/19/2024 12:05:38;0100;pbs_mom;Job;27.ip-0A582E84;Obit sent
12/19/2024 12:05:39;0100;pbs_mom;Req;;Type 54 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:05:39;0080;pbs_mom;Job;27.ip-0A582E84;copy file request received
12/19/2024 12:05:39;0100;pbs_mom;Job;27.ip-0A582E84;Staged 3/3 items out over 0:00:00
12/19/2024 12:05:39;0008;pbs_mom;Job;27.ip-0A582E84;no active tasks
12/19/2024 12:05:39;0100;pbs_mom;Req;;Type 55 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:05:39;0080;pbs_mom;Job;27.ip-0A582E84;delete file request received
12/19/2024 12:05:39;0008;pbs_mom;Job;27.ip-0A582E84;no active tasks
12/19/2024 12:05:39;0100;pbs_mom;Req;;Type 6 request received from root@10.88.46.132:15001, sock=0
12/19/2024 12:05:39;0080;pbs_mom;Job;27.ip-0A582E84;delete job request received
12/19/2024 12:05:39;0008;pbs_mom;Job;27.ip-0A582E84;kill_job
12/19/2024 12:08:07;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 10.88.46.132:15001 on stream 0
12/19/2024 12:08:07;0002;pbs_mom;Svr;im_eof;Server closed connection.
12/19/2024 12:08:07;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at ip-0A582E84:15001, stream:1
12/19/2024 12:08:07;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 10.88.46.132:15001 on stream 1
12/19/2024 12:08:07;0002;pbs_mom;Svr;im_eof;Server closed connection.
12/19/2024 12:08:08;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm ip-0A582E84:17001 down
12/19/2024 12:08:08;0001;pbs_mom;Svr;net_down_handler;net down handler called
12/19/2024 12:08:10;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.88.46.134:15003 to pbs_comm ip-0A582E84:17001
12/19/2024 12:08:10;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm ip-0A582E84:17001
12/19/2024 12:08:10;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
12/19/2024 12:08:14;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at ip-0A582E84:15001, stream:2
12/19/2024 12:08:14;0002;pbs_mom;Svr;pbs_mom;ReplyHello from server at 10.88.46.132:15001
12/19/2024 12:09:18;0100;pbs_mom;Req;;Type 54 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:09:18;0080;pbs_mom;Job;28.ip-0A582E84;copy file request received
12/19/2024 12:09:18;0008;pbs_mom;Job;28.ip-0A582E84;created the job directory /home/saf112092/pbs.28.ip-0A582E84.x8z
12/19/2024 12:09:18;0100;pbs_mom;Job;28.ip-0A582E84;Staged 1/1 items in over 0:00:00
12/19/2024 12:09:19;0100;pbs_mom;Req;;Type 1 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:09:19;0100;pbs_mom;Req;;Type 3 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:09:19;0100;pbs_mom;Req;;Type 5 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:09:19;0008;pbs_mom;Job;28.ip-0A582E84;created the job directory /home/saf112092/pbs.28.ip-0A582E84.x8z
12/19/2024 12:09:19;0008;pbs_mom;Job;28.ip-0A582E84;Started, pid = 12984
12/19/2024 12:12:32;0080;pbs_mom;Job;28.ip-0A582E84;task 00000001 terminated
12/19/2024 12:12:32;0008;pbs_mom;Job;28.ip-0A582E84;Terminated
12/19/2024 12:12:32;0100;pbs_mom;Job;28.ip-0A582E84;task 00000001 cput=00:00:01
12/19/2024 12:12:32;0008;pbs_mom;Job;28.ip-0A582E84;kill_job
12/19/2024 12:12:32;0100;pbs_mom;Job;28.ip-0A582E84;ip-0A582E86 cput=00:00:01 mem=81772kb
12/19/2024 12:12:32;0100;pbs_mom;Job;28.ip-0A582E84;Obit sent
12/19/2024 12:12:33;0100;pbs_mom;Req;;Type 54 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:12:33;0080;pbs_mom;Job;28.ip-0A582E84;copy file request received
12/19/2024 12:13:04;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 28.ip-0A582E84.OU saf112092@ip-0a582e84://FeaJob.py.o28 status=1, try=1
12/19/2024 12:13:35;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 28.ip-0A582E84.OU saf112092@ip-0a582e84://FeaJob.py.o28 status=1, try=2
12/19/2024 12:14:17;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 28.ip-0A582E84.OU saf112092@ip-0a582e84://FeaJob.py.o28 status=1, try=3
12/19/2024 12:14:48;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 28.ip-0A582E84.OU saf112092@ip-0a582e84://FeaJob.py.o28 status=1, try=4
12/19/2024 12:15:09;0001;pbs_mom;Fil;copy_file;Job 28.ip-0A582E84: sys_copy failed, return value=1
12/19/2024 12:15:09;0004;pbs_mom;Fil;28.ip-0A582E84.OU;Unable to copy file 28.ip-0A582E84.OU to ip-0a582e84://FeaJob.py.o28
12/19/2024 12:15:09;0004;pbs_mom;Fil;28.ip-0A582E84.OU;ip-0A582E84: Connection refused
12/19/2024 12:15:09;0001;pbs_mom;Fil;stage_file;Job 28.ip-0A582E84: no wildcards:remote stageout failed for saf112092 from 28.ip-0A582E84.OU to ip-0a582e84://FeaJob.py.o28
12/19/2024 12:15:09;0100;pbs_mom;Job;28.ip-0A582E84;Job files not copied:---->>>>
12/19/2024 12:15:09;0100;pbs_mom;Job;28.ip-0A582E84;Unable to copy file 28.ip-0A582E84.OU to ip-0a582e84://FeaJob.py.o28
12/19/2024 12:15:09;0100;pbs_mom;Job;28.ip-0A582E84;>>> error from copy
12/19/2024 12:15:09;0100;pbs_mom;Job;28.ip-0A582E84;>>> end error output
12/19/2024 12:15:09;0100;pbs_mom;Job;28.ip-0A582E84;---->>>>
12/19/2024 12:15:09;0100;pbs_mom;Job;28.ip-0A582E84;Staged 0/3 items out over 0:02:36
12/19/2024 12:15:09;0008;pbs_mom;Job;28.ip-0A582E84;no active tasks
12/19/2024 12:15:09;0080;pbs_mom;Req;req_reject;Reject reply code=15051, aux=0, type=54, from root@10.88.46.132:15001
12/19/2024 12:15:09;0100;pbs_mom;Req;;Type 55 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:15:09;0080;pbs_mom;Job;28.ip-0A582E84;delete file request received
12/19/2024 12:15:09;0008;pbs_mom;Job;28.ip-0A582E84;no active tasks
12/19/2024 12:15:09;0100;pbs_mom;Req;;Type 6 request received from root@10.88.46.132:15001, sock=2
12/19/2024 12:15:09;0080;pbs_mom;Job;28.ip-0A582E84;delete job request received
12/19/2024 12:15:09;0008;pbs_mom;Job;28.ip-0A582E84;kill_job
12/19/2024 12:09:18 L Considering job to run
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_max: entered for workq
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_queued: entered for workq
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_max: entered for server
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_queued: entered for server
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_max: entered for workq, alt_res (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_queued: entered for workq, alt_res (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for
workq]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_max: entered for server, alt_res (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_queued: entered for server, alt_res (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for
server]
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-0A582E84, op_flag f, alt_res_ptr (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue workq, op_flag f, alt_res_ptr (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
12/19/2024 12:09:18 S Job Queued at request of saf112092@ip-0a582e84, owner = saf112092@ip-0a582e84, job name = FeaJob.py, queue =
workq
12/19/2024 12:09:18 S Job Run at request of Scheduler@ip-0a582e84 on exec_vnode (ip-0A582E86:ncpus=1:mem=2097152kb)
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-0A582E84, op_flag 7, alt_res_ptr (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue workq, op_flag 7, alt_res_ptr (nil)
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
12/19/2024 12:09:18 S ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for workq]
12/19/2024 12:09:18 S ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
12/19/2024 12:09:18 L Job run
12/19/2024 12:09:18 S Updated job state to 81 and substate to 11
12/19/2024 12:09:18 S enqueuing into workq, state Q hop 1
12/19/2024 12:09:18 S Updated job state to 82 and substate to 15
12/19/2024 12:09:19 S Updated job state to 82 and substate to 41
12/19/2024 12:09:21 S Received session ID for job: 12984
12/19/2024 12:09:21 S Updated job state to 82 and substate to 42
12/19/2024 12:09:31 S Received the same SID as before: 12984
12/19/2024 12:09:47 S Received the same SID as before: 12984
12/19/2024 12:10:10 S Received the same SID as before: 12984
12/19/2024 12:10:38 S Received the same SID as before: 12984
12/19/2024 12:11:12 S Received the same SID as before: 12984
12/19/2024 12:11:52 S Received the same SID as before: 12984
12/19/2024 12:12:33 S Obit received momhop:1 serverhop:1 state:R substate:42
12/19/2024 12:12:33 S Updated job state to 69 and substate to 50
12/19/2024 12:12:33 S Updated job state to 69 and substate to 51
12/19/2024 12:15:09 S Post job file processing error
12/19/2024 12:15:09 S Updated job state to 69 and substate to 52
12/19/2024 12:15:09 S Updated job state to 69 and substate to 53
12/19/2024 12:15:09 S Exit_status=0 resources_used.cpupercent=2 resources_used.cput=00:00:01 resources_used.mem=81772kb
resources_used.ncpus=1 resources_used.vmem=272868kb resources_used.walltime=00:03:13
12/19/2024 12:15:09 S ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-0A582E84, op_flag b, alt_res_ptr (nil)
12/19/2024 12:15:09 S ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
12/19/2024 12:15:09 S ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
12/19/2024 12:15:09 S ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
12/19/2024 12:15:09 S ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
12/19/2024 12:15:09 S ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue workq, op_flag b, alt_res_ptr (nil)
12/19/2024 12:15:09 S ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for workq]
12/19/2024 12:15:09 S ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
12/19/2024 12:15:09 S ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for workq]
12/19/2024 12:15:09 S ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Thanks, I did it.
It is not working either:
12/19/2024 17:33:57;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 33.ip-0A582E84.OU saf112092@ip-0a582e84:/opt/job_serv/FeaJob.py.o33 status=1, try=1
12/19/2024 17:34:28;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 33.ip-0A582E84.OU saf112092@ip-0a582e84:/opt/job_serv/FeaJob.py.o33 status=1, try=2
12/19/2024 17:35:10;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 33.ip-0A582E84.OU saf112092@ip-0a582e84:/opt/job_serv/FeaJob.py.o33 status=1, try=3
12/19/2024 17:35:41;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp 33.ip-0A582E84.OU saf112092@ip-0a582e84:/opt/job_serv/FeaJob.py.o33 status=1, try=4
I looks like the stageout is pointing to the wrong folder. It is pointing to the folder where I executed the job:
[saf112092@ip-0A582E84 job_serv]$ qsub /mnt/data/Public/pbs2/test14/FeaJob.py