I’m encountering a problem I haven’t seen before.
Running PBS Pro CE on Ubuntu 18.04.
If I submit large array jobs, the jobs fail to start (or fail to finish) if more than one job per node is running at a time. If I configure the jobs so that each node runs only one at a time, the jobs run fine.
Typically, there are two results;
- multiple pbs_mom processes are spawned, but fail to su to my user account, or
- the pbs_mom processes su to my account, then become defunct
Occasionally these failures are accompanied by mom_log messages stating that stdout/stderr files couldn’t be opened in the working directory.
Has anyone else experienced this error? Got any clues on how to diagnose this?