When I submit an array of 1000 identical jobs (clones) that run some simple I/O data processing many of the jobs are put into a “held” state due to failed initialization while others run perfectly fine. These jobs are exact copies of one another, meaning none of the script changes from job to job. I do not understand why some jobs fail to initialize while others run perfectly. When i run the same script on the command line or on an SGE cluster they run properly. What is causing this issue and how do I stop this from happening?
These “held” jobs also put a hold on the entire job array. How do I make it so that a single job being held does not stop the entire array from continuing?