Excessive execution time of programs in NFS directories

I have three computing nodes that mount a storage node via NFS. On the same NFS root shared directory, I use abaqus and a 24+24 CPU combination. Two nodes calculate an inp file, and the time taken is 2h32m11s. At the same time, I use 48 CPUs on the third node to calculate the same inp file, and the time taken is 6h18m11s.

This is ridiculously slow. How can I resolve this issue?

Abaqus performance depends heavily on the solver type (explicit or implicit), as it is both memory , I/O- and CPU-intensive (underlying hardware specification). For best results, use local SSDs as scratch storage and, if needed, cross-mount them across nodes. NFS-mounted directories typically degrade performance due to network bandwidth limits, read/write latency, and I/O congestion.

If a high-speed InfiniBand (IB) network is available, cross-mounting SSDs over it can improve throughput. Otherwise, it’s generally better to run the job on a single node—especially when you know that the model does not scale efficiently across multiple nodes.

Thank you very much for your help; it’s very helpful. But what’s the purpose of “cross-mount them across nodes”? Doesn’t it also use NFS for mounting? It’s still an NFS shared directory. Wouldn’t that cause the same problem?

Yes, correct. ( Local Disk cross mount , multiple NFS servers , load is distributed, the I/O happens on the NFS mount of the node if an only if the node is involved in the computation.

  • your setup has one NFS server (Global NFS let say), all the trafffic goes through this NFS Server - which is bottleneck and i/o load is not distributed.

It might after certain threshold, but it is much better than having one NFS server .

Hence these shared, parallel and distrubuted file systems are used along with fast interconnect - NFS , BeeGFS , Luster , GlusterFS, PanFS, WekaFS, GPFS (PixSotr) of which some are open source and some are commercial offering.

If your question is why the single node case takes so much longer than the two node case, I can think of a couple reasons:

  • How many physical cores does each node have? If the single-node CPUs are really just threads, rather than separate cores, then such a slow-down is common.
  • Two nodes gives you double the capacity of a single-node: twice as much CPU cache, twice as much memory bandwidth, twice as much application memory until swapping becomes necessary, twice as much file system buffer cache, twice as much network bandwidth to the NFS server.

In the case of a single network bandwidth, such as a Gigabit network, even if the NFS server mounts are distributed, if all nodes are executing jobs, it should not be able to optimize the performance.

Each node has 24 physical cores with hyperthreading enabled. How do I understand the phrase “twice as much network bandwidth to the NFS server”? I only have one gigabit network connection.

What’s even more surprising is that the two jobs started executing at the same time, but the jobs running on both nodes completed over two hours later. During this time, they didn’t compete with the job still running on the single node, yet ultimately took an unreasonable amount of time to complete.

The NFS server uses HDDs, and the compute nodes also use HDDs. This does reduce performance, but I can’t understand why a single node could take over six hours to complete.

Please check on these :

  • cores/threads , multiple jobs running on the same node when your test job ran, - as mentioned above by @dtalcott
  • the specification of the system
  • the specification of the network
  • the specification of the ( number of elements, displacement, stress, strain, tatigue) input deck or input file (explicit/implicity) of the solver. We have seen some input decks run just fine and some input decks run slow due to the elements type / specifiction
  • scalability of the application
  • I think you can run a pre-run check in abaqus to get the pre-requisites ready
  • Also, you would need to take some advise from vendor about the best configuration for the cluster setup.

You are right. With only the one network, using two nodes won’t give you twice the bandwidth. It might be better than one node or might be worse, depending on your I/O patterns.

What’s even more surprising is that the two jobs started executing at the same time, but the jobs running on both nodes completed over two hours later. During this time, they didn’t compete with the job still running on the single node, yet ultimately took an unreasonable amount of time to complete.

Now I’m puzzled. In your first post, you say the two-node job took 2h32m11s. I assumed this was walltime. Was it really CPU time? If it was CPU time reported by abaqus, then this makes some sense. If abaqus reports just the CPU time of the first node, then the two nodes together took twice that, or about 5 hours. Which is about right compared to the single-node CPU time of 6+ hours, given that busy hyperthreaded CPUs run at about 70 - 80% of the speed of a single thread per core.

Then, the question is why the two-node job finished after the single-node job. My guess (and I am not familiar with abaqus) is that the extra time was due to communication and synchronization between the two nodes. This would not show up in the CPU time, but would affect walltime. However, this extra walltime seems excessive to me for an application designed to run in parallel. So, I think there is something else going on.