@subhasisb and @billnitzberg based on your points I did a few tests. They are not at the scale for 1000+ moms or 1M jobs but I could see the effect of sending all the resources to MoM at a much lower scale.
The first test was with the best combination of what I could achieve with the resources I had.
No. of jobs |
No. of resources on job |
Data on each custom resource |
Total data per job on resources_list |
Time taken by send_job for single job |
10000 |
1 |
500kb |
500kb |
0 Second |
1000 |
5 |
500kb |
2.5Mb |
0 Second |
1000 |
10 |
500kb |
5Mb |
0 Second |
1000 |
50 |
500kb |
25Mb |
0 Second |
500 |
100 |
500kb |
50Mb |
1 Second |
100 |
250 |
500kb |
75Mb |
2 Seconds |
As we started to cross the 25mb limit there was a delay of 1 second and at 75mb it became 2 seconds.
Next, I tested the usual scenario with 10 resources each having 4mb of data. And I introduced a mom hook with event execjob_end to see its effect on job execution. All are single node 10-second sleep jobs. All 3 cases I had 50 nodes running 50 jobs. The server had the same load in terms of jobs.
Case 1: With 10 resources each having 4mb data.:-
[root@blrentperf03 work]# grep ‘25.blrentperf03’ /SSD_STORAGE/mom25/mom_logs/20190315
03/15/2019 00:17:35;0008;pbs_mom;Job;25.blrentperf03;Started, pid = 48928
03/15/2019 00:17:45;0080;pbs_mom;Job;25.blrentperf03;task 00000001 terminated
03/15/2019 00:17:45;0008;pbs_mom;Job;25.blrentperf03;Terminated
03/15/2019 00:17:45;0100;pbs_mom;Job;25.blrentperf03;task 00000001 cput= 0:00:00
03/15/2019 00:17:45;0008;pbs_mom;Job;25.blrentperf03;kill_job
03/15/2019 00:17:45;0100;pbs_mom;Job;25.blrentperf03;blrentperf03 cput= 0:00:00 mem=0kb
03/15/2019 00:17:45;0008;pbs_mom;Job;25.blrentperf03;no active tasks
03/15/2019 00:17:45;0100;pbs_mom;Job;25.blrentperf03;Obit sent
03/15/2019 00:17:51;0080;pbs_mom;Job;25.blrentperf03;delete job request received
03/15/2019 00:17:51;0008;pbs_mom;Job;25.blrentperf03;kill_job
Total time since job execution to clean up: 6 seconds.
Case 2: With 10 resources each having 4mb data and an execjob_end hook:
[root@blrentperf03 work]# grep ‘75.blrentperf03’ /SSD_STORAGE/mom25/mom_logs/20190315
03/15/2019 00:34:22;0008;pbs_mom;Job;75.blrentperf03;Started, pid = 1700
03/15/2019 00:34:32;0080;pbs_mom;Job;75.blrentperf03;task 00000001 terminated
03/15/2019 00:34:32;0008;pbs_mom;Job;75.blrentperf03;Terminated
03/15/2019 00:34:32;0100;pbs_mom;Job;75.blrentperf03;task 00000001 cput= 0:00:00
03/15/2019 00:34:32;0008;pbs_mom;Job;75.blrentperf03;kill_job
03/15/2019 00:34:32;0100;pbs_mom;Job;75.blrentperf03;blrentperf03 cput= 0:00:00 mem=0kb
03/15/2019 00:34:32;0008;pbs_mom;Job;75.blrentperf03;no active tasks
03/15/2019 00:34:32;0100;pbs_mom;Job;75.blrentperf03;Obit sent
03/15/2019 00:34:38;0080;pbs_mom;Job;75.blrentperf03;delete job request received
03/15/2019 00:34:41;0008;pbs_mom;Job;75.blrentperf03;no active tasks
03/15/2019 00:34:42;0008;pbs_mom;Job;75.blrentperf03;kill_job
Total time since job execution to clean up: 10 seconds.
Case 3: With no resources and an execjob_end hook:
[root@blrentperf03 work]# grep ‘125.blrentperf03’ /SSD_STORAGE/mom25/mom_logs/20190315
03/15/2019 00:38:51;0008;pbs_mom;Job;125.blrentperf03;Started, pid = 2363
03/15/2019 00:39:01;0080;pbs_mom;Job;125.blrentperf03;task 00000001 terminated
03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;Terminated
03/15/2019 00:39:01;0100;pbs_mom;Job;125.blrentperf03;task 00000001 cput= 0:00:00
03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;kill_job
03/15/2019 00:39:01;0100;pbs_mom;Job;125.blrentperf03;blrentperf03 cput= 0:00:00 mem=0kb
03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;no active tasks
03/15/2019 00:39:01;0100;pbs_mom;Job;125.blrentperf03;Obit sent
03/15/2019 00:39:01;0080;pbs_mom;Job;125.blrentperf03;delete job request received
03/15/2019 00:39:01;0008;pbs_mom;Job;125.blrentperf03;no active tasks
03/15/2019 00:39:02;0008;pbs_mom;Job;125.blrentperf03;kill_job
Total time taken since job execution to cleanup: 1 second.
I feel hooks are going to get affected more than send_job from the server to mother superior. Since we read and dump all attributes to a file then pbs_python will read the file to load the attributes again, for sure job start to finish time is going to see some delay as the number of resources sent to MoM increases.