I have recently installed our new cluster with OpenPBS 22.05.11 and Rocky Linux 8.8. Now I got the problem that coammands like qstat and so on sometimes take quite a lot of time to finsih or even fail with the error
pbs_iff: cannot connect to host
pbs_iff: all reserved ports in use
After investigating for a while I found that this happens when too many connections to the PBS server are in a TIME_WAIT state. So when I run the command
netstat -pant | grep 15001 | grep TIME_WAIT | wc -l
and the number I get here reaches 1024 I can see that the commands are slow. Below that number there is nor problem, but I never see that number going above 1024.
So now I checked my old cluster which is running PBS Pro 19.2.4 and found that the above netstat command gives me numbers bigger than 10000. So now my question is if the number of ports or connections for the PBS server is somehow limited to 1024? Or is this an issue of the OS?
Any help on this is very welcome!
Ports over 1023 are not reserved. So yes, you have only a limited number of reserved ports. Why these are special is that only root can bind to them (on UNIX/Linux).
To get rid of connections in TIME_WAIT more rapidly, tune the kernel tunable net.ipv4.tcp_fin_timeout. The default value is 60s but unless you are communicating over a very slow WAN with weird routes 10 or even 5 seconds are fine.
Note that some reserved ports might be in use by other daemons. The NIS client is an eager consumer of those too. Consider using nscd or sssd to cache username to uid mapping if those connections in TIME_WAIT are not from PBS…
Thabk you for your reply!
I think there is a bit of a misunderstanding. By reserved I don’t mean privileged ports but I am referring to the error message from PBS. My observation from PBS 19 is that there is not shortage of ports. And now with more investigation it looks to me that the state of the ports is not the issue but what ports PBS uses. On my new instalaltion Rocky Linux I saw that privileged ports were used as source ports. On the old PBS I saw that only port 15001 is used. I think this behaviour is not correct.
I could get rid of that problem now by using Munge for authentication but I still can’t figure out why PBS worked differently for me on Rocky Linux compared to my old installation on CentOS 7.
If you are interested in more details you can read this post on the OpenPBS GitHub