I am submitting array jobs with 80 subjobs in them.
In the job script, each subjob will read a line from the assigned text file for the parameters used in the Matlab script, change these value in the .m file with ‘sed’, and execute it with Matlab.
However, I am getting random errors which is some subjobs will fail almost instantaneously when it is started but with no error report. Status from Q->X. When I deleted the whole job array and resubmit the job array, the previous failed subjobs may start to run smoothly, but others which previously start smoothly would fail.
Does anyone know what may be related to this?
I think it is less likely that the parameters or the matlab code caused the error, because I tried some of the failed parameters and they could be ran smoothly on my laptop or on the cluster without error when I left the surviving ones to be completed.
Thank you very much!