I talked with my colleague about the possibilities.
Not necessarily. Anyway… I want to ask how the multicast improves performance? As far as I can see, the multicast spares only the dis* layer and in the tpp, the data are sent via the common stream, right? Am I missing something?
It is probably a half-hearted solution. My colleague did not like to ignore it. Since we want to provide a secure stream, we should not allow reading unsecured messages with important data. I agree that it makes sense. If we would be able to decide that cleartext data are ping or something unimportant then this is still an option. I am afraid it would be quite ugly to identify unimportant and unsecured data across layers.
My colleague likes the option. In such a case, wouldn’t be better to revert the dis_gss changes and go back and change the tcp layer instead of dis?
It sounds good. For me, it sounds like the most suitable solution now. Why do we need to take away multicast for join_job? Join_job convey the information about job structure only. I think we can use multicast for it.
I looked deeper at the tpp code: client annotates the multicast packet with all destination stream addresses while sending it but actual copies are done by pbs_comm (receiver sees the packet as if it was usual unicast one). I think the largest performance benefit stems from reduced blocking when sending to flaky nodes or those behind lossy network connection.
Some interface should exist which would allow retrieving security properties of individual channel (as per 4.) by routines interpreting the messages ( cases in im_request, dispatch_request) unless everything is secured (existing rpp_getaddr or similar named-based interface would be sufficient in the latter case probably).
comm would not need to know user credentials, only host credential names (derived from hostnames). In order to simplify the implementation (in exchange for CPU cycles used by re-encrypting) client<->comm TCP connections could be protected instead of individual RPP streams…
In very large clusters, there are many pbs_comm’s installed. For example, each rack of compute nodes can be handled by a different pbs_comm. The sender sends the multicast packet to only to the pbs_comm to which it is attached to. So:
a) The sender does not have to loop through a large number of serial sends. (imagine sending the same join_job packet to 20,000 sister moms)
b) The first comm can distribute it down to another set of comm’s that are the handlers of the target moms. This is sort of a fan-out protocol, like a tree-based overlay network.
Thus, in case of pings or hook sends or join_jobs the sending to tens of thousands of moms happen in parallel and is very fast.
I understand and agree.
Yes., if you like this, then we do not need the dis_gss changes. Then we should make changes to the TPP layer. This is how the munge authentication works (though that does not involve encryption)
Okay - if we are fine with sending join_jobs without encryption, we can keep another pair of rpp (tpp) connection for join_jobs as well. However, the current code always assumes that there is only one rpp/tpp connection between any two parties, so we need to carefully work around this assumption.
You are correct. The sender sends only one packet for a multicast to any number of moms, so the sender does not have to loop through and send serially. Additionally, the comm could also send out to multiple other comms who are handling the various receivers, and the last comm is the one that will break down the multi-cast packet into normal unicast packets and the receiver only sees a “normal” packet. Thus the distribution can happen in parallel like a fan-out protocol and thus is very fast (when the site uses multiple comms). For large sites with 10+ thousand moms, a pbs_comm can be configured to handle a set of moms, say at the rack leader to handle the moms that are part of that rack. In that case there is significant benefit of the parallel distribution. Even in the case of a single pbs_comm, since the sender is typically single threaded (mom or server), it benefits from not having to loop serially over a large number of receiver streams pumping the same packet to each of the receiver streams individually, the pbs_comm does it on its behalf.
yes, i agree. If we go that route, then we need to add these interfaces.
Yes, this is what we do in our munge authentication (though that is authentication only and not encryption). So, yes that could be one easy route to take and does not need user credentials, but only host credentials.
Well, rpp/tpp is a packet protocol not stream-oriented actually in spite of its endpoint being called “stream” and use of quite unusual interface, adding separate endpoint on receive side does not help much anyway.
Regarding interfaces, I think it would be sufficient and simple enough to add an interface like dis_message_properties(stream) which would return bitmasks like RPP_MULTICAST|GSS_INSECURE (ordinary multicast) or GSS_AUTHENTICATED|GSS_PRIVATE (authenticated & encrypted) for incoming/not-yet-rcommitted message and modifing im_request() &co. to check requirements of the received message type.
We can accept unsecured IM_JOIN_JOB in the first (incomplete) implementation stage (current GSS at dis layer only) and tighten the check later after comm is modified to be able to re-encrypt multicasts while making copies for leafs or other routers. (I would not declare unsecured IM_JOIN_JOB is good enough per design though.) No performance would be sacrificed this way.
There is another issue of tpp_em_pwait() looking at raw packets, it can signal incoming data is ready while it can be internal service traffic of GSS or similar layer not resulting in any upper layer bytes. This can be handled easily by adding another cn_func-like function (e.g. cn_ready_func) to struct connection (new add_conn parameter) for checking whether there is really anything ready after GSS handshake/unwrap processing. process_socket() would not call cn_func when cn_ready_func returns false.
Yes RPP/TPP are not stream oriented, but it is more so semantically. I understand what you mean
Yes we can fairly easily add these interfaces as well, to check the type of message returned, and in the initial version keep IM_JOIN_JOB as not encrypted
Yes we can add another handler, so that it checks for actual data being ready and only then calls process_request() etc.
Now, of course, none of the above will be required if we implement the protocol upto only the comm - i.e. if we unencrypt and encrypt individual packet at the comm. So, when you do this, it seems we will throw away all the above work, right?
I think i am open either way - what do you want to do?
The cn_ready_func part is necessary in all variants in order to handle TCP connections (user<->server, daemon<->comm if it is encrypted as a unit) without ugly hacks violating protocol layers. RPP interface additions would not be useful for full comm connection encryption.
I prefer the full comm connection encryption variant slightly, it is simpler (no need to modify TPP protocol
internal processing, simpler potential failure modes) and its main disadvantage, i.e. repeated encryption at each comm hop, does not seem to be very important in practice (large installation can afford quick processor and/or multiple comms, no really large data is sent through RPP currently, only control information and resource values).
I will add the cn_ready_func() and move the server part of TCP handshake into it, and the PBS_BATCH_AuthExternal will not handle the handshake. It will just inform the server that the TCP connection is authenticated.
I will also start the work on comm connection encryption. Since we are in threads, the handshake will be in a tight loop on this layer (I don’t see other option anyway).
In the TPP encryption side, can we use/extend the TPP_CTL_AUTH part? We use that for the Munge authentication; however that is straightforward and there is no need of more than one exchange. In case of gss encryption, we can add another subheader for TPP_CTL_AUTH and put some “STAGE” in that, just like how the server and clients do? Then, we do not need to put a tight loop of handshake?
Awesome, thanks! Also, another question, We will need to do the handshake only when the leaf daemons connect to comm right? TPP is designed to use persistent TCP connections, so unless there is moms/server/network going down, the connections are only one time…
I think so, we only need to establish the gss context on connecting to comm. After a successful establishing, the gss context is valid until mom/server/comm going down.
Concerning the TPP_CTL_AUTH part, we could use the handlers get_ext_auth_data() and validate_ext_auth_data() as with munge. It should be possible to postpone the TPP_CTL_JOIN until the handshake is finished. The issue with this solution is that the GSS layer would be mixed with tpp layer.
The second solution would be to replace handlers in tpp_transport_set_handlers() with gss_handler*(). gss_handlers*() would call the leaf*() or router*() handlers. This way the GSS layer will be nicely isolated in its own layer, which I prefer.
I am back with some news. I have done the implementation as agreed. The new code can be found on the branch kerberos_support_3. The design doc was also altered. The following was done:
The GSS code was unified and generalized. Redundant code was removed. Same GSS code is used for both the TCP and TPP.
The TPP encryption was fully changed as agreed. The connection with comm is now encrypted. It should also work between routers - Anything that connects to comm will use encryption (like server, moms, scheduler, other comm). So you need to have host keytab on those nodes. With GSS code enabled, it is forbidden to use cleartext with comm now. The implementation replaces the regular tpp handlers with new gss_* handlers, and the gss_* handlers call regular leaf or router handlers. The asynchronous handshake is always expected at the beginning of communication.
TCP was improved. If the client wants to connect to the server with encryption, the auth batch request is sent, which initiates the handshake. The new cn_ready_func notices that handshake is in progress and processes the handshake tokens asynchronously. Once the handshake is finished, the cn_ready_func returns true (after unwrapping data) and data are processed by regular process_request(). The GSS layer is also isolated in its own layer here. It means that dis_* handlers are replaced with gss_dis_* handlers and the interface was extended as needed (e.g. the tcp_read was exposed to gss_dis_* layer via a new handler).
The tool for renewing credentials “renew-test” was added into unsupported directory.
Miscellaneous improvements.
TCP allows using cleartext, which means that it is possible to use regular clients with the GSS enabled server. It is nice, and it also means that you can move a job between the regular server and GSS enabled server. Peer scheduling should also work. Adding the encryption on TCP between server and scheduler should be quite easy now, but let’s keep it in TODO for future commit.
It is also possible to enable encryption from hooks. It is very well ready for it. It actually already works. The pbs_python just needs to have a valid Kerberos ticket in the default location. Let’s keep this in TODO because here we have also hooks run as users, which should be (maybe) addressed with proper user credentials.
I am quite happy with the changes. Let me know what do you think. I am ready to address more comments.
Hi @vchlum, i looked through the design changes and the code as well.
Many thanks. This is indeed a huge improvement. And quite exciting to me. The way it is structured now, i think it will be quite simpler for us to add TLS, for example. I like the layering of the dis_gss* functions and the gss_tpp_handlers. The negotiations are asynchronous all across and that is great. And communications via comm is totally encrypted, end to end.
I think you are very close to raising a PR. I can’t think of anything else to ask you right now, and I agree completely to the items that you mentioned as TODO for the short future.
Thank you @subhasisb for looking into it. I am very happy to be closer to PR.
I will go again through the code carefully and I will try to find out what can be improved/cleaned/commented/… I am quite happy with the code, and If you do not have major comments (or anybody else, of course), I assume no major changes.
@vchlum i do not have any major changes needed in the code - after cleanup you may raise PR. Then we can start detailed code review, which, of course, will take a bit of time, but that is usual.
Hi,
Since the Kerberos feature is merged (thank you), I have started to work on automated tests and I have some more ideas on what to do next and I would like to inform you about it:
I started the work on the tests. As a phase one, I would like to add Kerberos builds into Travis CI and I realized there are 5 concurrent runs now. Is it OK to add new runs? My thought is to have a smoke test with Kerberos support in Travis in the end (ideally for both MIT and Heimdal). Another possibility would be to add only building with Kerberos support in order to keep the two extra runs short. What do you think? …or just let me know if the Kerberos build for Travis CI is not suitable because the limit is 5 concurrent runs.
If the server is not available, the jobs are not renewed right now. If the unavailability of the PBS server lingers for a significant period of time (it is configurable already) the jobs could fail. Kerberos allows issuing ‘renewable tickets’. The idea is to renew a ‘renewable ticket’ on mom as long as the renewing allows. (this is usually max. a few days) If the ticket wouldn’t be renewable this has no effect of course. If the server is available this feature could be also without effect because the ticket will be renewed in time (depends on configuration). This will increase the robustness of renewing credentials.
The renew tool demands new credentials every time no suitable cached credentials are available. if renewing fails for a user, the credentials are demanded again for another job and it will likely fail again. The new idea is to demand credentials - if it fails - for the same user only once per check renew period (which is 5 minutes right now). If it fails for some user the credentials will not be demanded for this user for next 5 minutes. This will eliminate unnecessary calls of the renew tool. The tool can e.g. access the KDC directly, this will help to reduce the load on KDC caused by failing demands. If the demands timeout for some reason, this will help to prevent the server from sticking in renew timeouts.
Those features don’t need to add new interfaces so far. If you have any comments, I would be happy to discuss it.