Kerberos support

vchlum · October 22, 2018, 11:34am

I have started to work on the code cleanup and I have a question. Should we remove the code related to GRIDPROXY and AES too? I am not sure what is the state of this code. Since it is an alternative to Kerberos, I kind of think that it is also the dead code, isn’t it?

If yes, I think we can remove the general functions related to the credentials - but it deserves a discussion.

Vasek

subhasisb · October 22, 2018, 12:22pm

Hi Vasek,

Thanks for offering to remove the dead code. I think it will be great if you can help with removing the GRIDPROXY cred related stuff. However, I think the AES credtype is being used on the windows compiles and also for pbs_password.c (also inside the database password encryption/decryption) - some time back i had added those but used the same header and structure…

Thanks,
Subhasis

vchlum · November 6, 2018, 1:09pm

Hi,

I just want to inform you that I work on design document. Should the design document cover all the new info, debug, and error messages?

Vasek

mkaro · November 7, 2018, 3:39pm

Hi Vasek,

The design should cover all of the log messages you use for testing (because PTL becomes a consumer of those messages) and any others you feel should be documented in the admin guide to help troubleshoot when a problem occurs.

Thanks,

Mike

vchlum · November 8, 2018, 7:06am

OK, thank you @mkaro.

May I ask what is the status? @mkaro, @subhasisb Just a kind question:) Absolutely no rush. I am just not sure whether I should wait for some pre-PR comments or work on this and push this forward.

I am aware there is still the krb525 code (hidden by macro now) on the branch. That is because I want to move it as late as possible because of resolving conflicts.

I also prepared a solution with using engage_external_authentication but it is on a different branch.

V.

subhasisb · November 8, 2018, 8:49am

Sorry for the delays @vchlum - been a festival month here. Would it be possible to share the one with engage_external_authentication? I can take a peek at that.

vchlum · November 8, 2018, 9:26am

Of course @subhasisb: https://github.com/CESNET/pbspro/tree/kerberos_support_2_ext_auth

Vasek

subhasisb · November 9, 2018, 10:20am

HI @vchlum i had an initial look at it and looks fine so far. Few things I am still trying to grapple:

Still not sure why you need the sync byte?
Is the TPP communication (server/mom connecting with comm) also being covered by kerberos? If so, is it working over the engage external authentication like munge?
Where is credential timeout/renewal etc. happening - sorry - got a bit lost in the code
if not too much, can you please squash all your commits together (so that i can see all the changes in one commit - otherwise difficult to shuffle between multiple commits)

Just for easy understanding (beyond external design etc.) it could help if you can describe the control flow for the authentication (if at all possible with your time)…?

vchlum · November 9, 2018, 1:15pm

Hi @subhasisb

The problem is how to switch to wrapped/encrypted communication with data being buffered on the recipient. First, the communication is in cleartext and GSS exchanges messages (tokens) needed to establish GSS context between the server and the client. These messages are in cleartext. Once there is enough data exchanged, the context is established on both sides. And once the context does exist, all new data are wrapped (encrypted) by GSS wrap on both sides. Now, the problem comes…

The ‘reply_ack(request)’ is sent immediately after the last cleartext GSS token (last token needed for the GSS context on the other side) is sent. Problem is that ‘reply_ack()’ is already encrypted because the GSS context is established right after sending the last GSS token and right before the ‘reply_ack(request)’ is sent. Let’s move to the recipient… The recipient still needs the last GSS token to receive in cleartext and now the token is being received… And this is the race: Sometimes the ‘reply_ack(request)’ is read and buffered together with last GSS token in cleartext and sometimes the last GSS token is read separately after the GSS context is established - ‘reply_ack(request)’ is read correctly in another reading with fully established GSS context.

The sync byte forces to wait for establishing the GSS context on the other side before the ‘reply_ack(request)’ is sent. I am not fully satisfied with this solution, but I don’t know how to do it better.

I am sorry, only the TCP is covered by external authentication:( The TPP encryption is done on RPP stream layer, which is unfortunate for external authentication.

The server is responsible for sending the renewed credentials to jobs in time. Every job has an attribute with the validity (credential_validity) of the credentials. Please, see the server/svr_credfunc.c. There is a work task ‘svr_renew_creds’ on the server side. The work task runs every SVR_RENEW_CREDS_TM seconds. This work task traverses all jobs and checks the validity of the credentials of all jobs. If the validity of a particular job is due, the ‘svr_renew_job_cred’ task is run. In the ‘svr_renew_job_cred’ the ‘send_cred’ is run. The renewed credentials are obtained and sent to superior mom. Once the credentials are received on the mom side, they are stored in the memory with ‘store_or_update_cred()’. After this, credentials are sent to sister moms and the function ‘resmom/renew.c:renew_job_cred()’ is called and the renewing continues in ‘resmom/renew.c’.

OK, everything is squashed on the kerberos_support_2 branch now: https://github.com/CESNET/pbspro/tree/kerberos_support_2
Please, checkout again. I usually do the rebase and squash once a week.

Comming in the next post…

vchlum · November 11, 2018, 1:26pm

I have added a new chapter GSS-API in PBS Pro to the design. It is a bit long for a post. Please let me know if you find the answer there. It could help with debugging and from the point of view I think it can be in design… or I can remove/move it later:).

I suppose you talk about the pbsgss_client_authenticate() and req_gssauthenuser()… This is the part where the credentials are acquired and GSS handshake is done - GSS context is established. After this, the communication is encrypted. This part has similar logic both on TCP and TPP but the implementation needs to be very different. The reason for different implementation is that we have the socket file descriptor available and we can read and write to the file descriptor directly with TCP, but with TPP, we are not able to communicate directly ‘client ↔ server’. The communication goes through the comm with TPP.

It will be maybe also clearer why the external authentication is used only for TCP - if I am not mistaken.

Vasek

subhasisb · November 20, 2018, 10:24am

vchlum:

The problem is how to switch to wrapped/encrypted communication with data being buffered on the recipient. First, the communication is in cleartext and GSS exchanges messages (tokens) needed to establish GSS context between the server and the client. These messages are in cleartext. Once there is enough data exchanged, the context is established on both sides. And once the context does exist, all new data are wrapped (encrypted) by GSS wrap on both sides. Now, the problem comes…

The ‘reply_ack(request)’ is sent immediately after the last cleartext GSS token (last token needed for the GSS context on the other side) is sent. Problem is that ‘reply_ack()’ is already encrypted because the GSS context is established right after sending the last GSS token and right before the ‘reply_ack(request)’ is sent. Let’s move to the recipient… The recipient still needs the last GSS token to receive in cleartext and now the token is being received… And this is the race: Sometimes the ‘reply_ack(request)’ is read and buffered together with last GSS token in cleartext and sometimes the last GSS token is read separately after the GSS context is established - ‘reply_ack(request)’ is read correctly in another reading with fully established GSS context.

The sync byte forces to wait for establishing the GSS context on the other side before the ‘reply_ack(request)’ is sent. I am not fully satisfied with this solution, but I don’t know how to do it better.

I think a forced DIS_tcp_wflush(sock) call would write the bytes in the current buffer out? That might eliminate the need for the sync byte…

vchlum:

The server is responsible for sending the renewed credentials to jobs in time. Every job has an attribute with the validity (credential_validity) of the credentials. Please, see the server/svr_credfunc.c. There is a work task ‘svr_renew_creds’ on the server side. The work task runs every SVR_RENEW_CREDS_TM seconds. This work task traverses all jobs and checks the validity of the credentials of all jobs. If the validity of a particular job is due, the ‘svr_renew_job_cred’ task is run. In the ‘svr_renew_job_cred’ the ‘send_cred’ is run. The renewed credentials are obtained and sent to superior mom. Once the credentials are received on the mom side, they are stored in the memory with ‘store_or_update_cred()’. After this, credentials are sent to sister moms and the function ‘resmom/renew.c:renew_job_cred()’ is called and the renewing continues in ‘resmom/renew.c’.

This sounds all complete. Thanks.

Sorry, was OOO the whole of last week - will check it this week.

subhasisb · November 20, 2018, 10:28am

vchlum:

I have added a new chapter GSS-API in PBS Pro to the design. It is a bit long for a post. Please let me know if you find the answer there. It could help with debugging and from the point of view I think it can be in design… or I can remove/move it later:).

I suppose you talk about the pbsgss_client_authenticate() and req_gssauthenuser()… This is the part where the credentials are acquired and GSS handshake is done - GSS context is established. After this, the communication is encrypted. This part has similar logic both on TCP and TPP but the implementation needs to be very different. The reason for different implementation is that we have the socket file descriptor available and we can read and write to the file descriptor directly with TCP , but with TPP, we are not able to communicate directly ‘client <-> server’ . The communication goes through the comm with TPP.

This actually sounds quite okay. I am assuming the the communication via the RPP stream (which goes via TPP and the comm) gets established between the server-mom and mom-mom. I will go over your document in detail this week, but it looks quite good at a quick glance. Thanks!

vchlum · November 26, 2018, 8:48am

DIS_tcp_wflush(sock) is called after the last cleartext message and before reply_ack(request).

The problem is on the recipient. The last cleartext message is sometimes received (buffered) together with reply_ack(request).

V.

subhasisb · November 27, 2018, 10:09am

Understood. One more thought, sorry, if i am dragging this. Is it possible to pass a length before the message to indicate the message length so that the receiver know exactly how many bytes is clear text vs where the encrypted data starts?

subhasisb · November 27, 2018, 10:14am

@vchlum your design document looks good to me.

Are you planning to add some test procedures (setup and test cases) - at some point? That could help the rest of community to get started with using this feature.

Thanks,
Subhasis

vchlum · November 27, 2018, 12:16pm

Your idea is good. Thanks. It seems to be easy now:) Actually, the length of GSS token needed for establishing GSS context is already sent in front of the token itself (which is also last cleartext msg). The solution should be to modify the tcp_read() with the possibility to read only limited length. Working on it.

vchlum · November 27, 2018, 12:36pm

Yes, I certainly want to provide tests for the Kerberos feature, but I did not start to work on the tests yet. Since this is pretty complex, should the new tests be part of the first merge or is it reasonable to provide the tests later? We will need to build the whole Kerberos world in the test scripts. We will also need the external tool for providing credentials in the tests. Is it OK to build and use our tool in the scripts for now?

Vasek

subhasisb · November 28, 2018, 5:40am

Hi Vasek,

I feel, given the size of the work, it is okay to develop the automated test scripts later (as a separate PR/commit). However, since you are anyway testing the changes to work with kerberos, some basic text document detailing how you have set up kerberos and some manual tests would benefit the maintainers as well as the community. Does that sound workable?

Thanks,
Subhasis

subhasisb · November 28, 2018, 5:42am

Perhaps, you do not need to modify even tcp_read() to include a length. Usually the DIS_read routines can read exact the amount of required bytes from the already received buffer (which tcp_read() would read and keep in its internal buffers). So in this case, you could code the particular DIS_xxx routine to read only a specific set of bytes, and no more? Which routine is reading this data - perhaps i can take a look.

vchlum · November 28, 2018, 1:29pm

OK, sounds good to me.

V.

Topic		Replies	Views
Problem with kerberos Users/Site Administrators	1	450	February 21, 2023
Timetable for encrypted PBS server to sister server traffic Users/Site Administrators	1	510	August 13, 2019
PP-1206: Remove obsolete code and platforms from PBS Pro Developers	5	1409	January 31, 2018
TLS encryption support Users/Site Administrators	2	607	October 20, 2020
Can not submit a job to HPC cluster after logging in Users/Site Administrators	1	146	May 11, 2024

Kerberos support

Related topics