-
Notifications
You must be signed in to change notification settings - Fork 146
feat: implements crl support in zTunnel #1660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
|
😊 Welcome @nilekhc! This is either your first contribution to the Istio ztunnel repo, or it's been You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines Thanks for contributing! Courtesy of your friendly welcome wagon. |
|
/test all |
|
I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this |
6cfcad3 to
c522f5f
Compare
|
/test all |
Hi @howardjohn, Thanks for the comment. The original design I presented in one of the community meetings received some valuable feedback. Based on that, I pivoted the design to propagate CRL information through a ConfigMap, similar to how the ca-root-cert is propogated. I’ve captured these updates in the same design document, under the Implementation section. This PR simply uses that CRL ConfigMap and validates the request accordingly. |
c522f5f to
ce5886a
Compare
|
/test all |
ce5886a to
2402c4d
Compare
|
/test all |
| socket2::SockRef::from(&s).set_tcp_keepalive(&ka) | ||
| ); | ||
| } | ||
| #[cfg(target_os = "linux")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting an error on macos due to cross compilation. I have to add this to some places to fix it. Please let me know if there is a better way to handle this.
|
/retest |
src/proxy/connection_manager.rs
Outdated
|
|
||
| /// Close all inbound and outbound connections | ||
| /// This is called when certificate revocations are detected | ||
| pub fn close_all_connections(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really dangerous and bad behavior. How strongly do you feel about keeping this?
Does any other application do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I don't think we should be closing every single connection just because of one revocation. What does Envoy do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I addd this since http connections are long lived and does not check updated crl. Is there a better way to handle this? I know envoy has some timeout we can configure. Not comepletly sure how zTunnel handles it. Happy to tweak this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is to do nothing. But at the very least, only close ones that are revoked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing nothing feels wrong IMO; the purpose of a CRL is to be able to break a particular trust chain, and you don't want existing connections to continue. Only closing the ones that are revoked is the right solution to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is restart your clients if you want the CRL to take effect quickly the UX we want then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, zTunnel should terminate that connection selectively and avoid restarting the workload. Since the user provides/updates the CRL file, they’re already aware of potentially revoked certificates, so why ask them to take action twice?
(1) Create or update the CRL, and
(2) Restart the workload pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Playing advocate aside, I suspect that most folks would accept a limitation where existing connections are not re-evaluated on a CRL refresh. This seems to be a fairly common behavior. I think I would lean towards breaking support for re-evaluating existing connections into a separate PR (assuming we do want to explore that functionality).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate PR seems reasonable to me so we can evaluate the complexity on its own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted the outbound related changes and now it only selectively terminates inbound connections. PTAL. Thanks
If you'd like to see the conversation where this was reviewed, it's in the 4/14 wg meeting at https://www.youtube.com/watch?v=welIiuqUvn0&t=118s |
5509cf6 to
37f0ca0
Compare
11a8b0b to
0f855bf
Compare
|
@howardjohn @ilrudie @keithmattix Could you PTAL when you get a chance? Thanks. |
0f855bf to
01540f7
Compare
ilrudie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me this bundles functionality that belongs in something like PolicyWatcher into the ConnectionManager. Please take a look at that, specifically how run works and is invoked and works for an example of how similar functionality was implemented.
src/proxy/inbound.rs
Outdated
| let rbac_ctx = ProxyRbacContext { | ||
| conn: conn.clone(), | ||
| dest_workload: destination_workload, | ||
| }; | ||
| let mut tls_guard = pi | ||
| .connection_manager | ||
| .register_tls_connection(&rbac_ctx, client_cert_serials.clone()); | ||
| debug!("TLS connection registered for revocation tracking"); | ||
|
|
||
| // Get watcher BEFORE serving to listen for revocation-based termination | ||
| let tls_drain_watch = tls_guard.watcher(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serv_connect will already create a connection guard which will end the connection when asked. I don't think we need a second connection guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a commit didn't get pushed. We still have a second connection guard being created
src/proxy/inbound.rs
Outdated
| // This allows CRL revocation to terminate the connection immediately | ||
| tokio::select! { | ||
| result = serve => { | ||
| tls_guard.release(); | ||
| result | ||
| } | ||
| _ = tls_drain_watch.wait_for_drain() => { | ||
| debug!("TLS connection terminated due to certificate revocation"); | ||
| tls_guard.release(); | ||
| // Return an error to signal abnormal termination | ||
| // This helps with proper cleanup and error reporting | ||
| Err(std::io::Error::new( | ||
| std::io::ErrorKind::ConnectionAborted, | ||
| "connection terminated due to certificate revocation" | ||
| ).into()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above, we already take a guard that closes the connection, I don't think we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored
Thanks for the pointers @ilrudie. I refactored the code with CRLWatcher much like PolicyWatcher. PTAL Thanks. |
9d7986a to
2071f5f
Compare
|
Local testing - Rejecting exisiting connection: 2025-12-02T00:27:53.591237Z warn tls::crl detected 1 NEW certificate revocation(s)
2025-12-02T00:27:53.591338Z warn tls::crl newly revoked serial: [123, 235, 92, 169, 143, 75, 240, 223, 161, 240, 41, 122, 127, 234, 128, 208, 103, 54, 93, 192]
2025-12-02T00:27:53.591387Z info tls::crl NEW REVOCATIONS DETECTED - CRL watcher will handle connection closures
2025-12-02T00:27:53.591552Z info proxy::h2::server:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} starting graceful drain (TLS connection certificate revoked)
2025-12-02T00:27:53.592162Z info tls::crl_watcher:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} closed connection 10.244.0.8:46424(spiffe://cluster.local/ns/crl-test/sa/sleep)->10.244.0.9:15008 (Kubernetes//Pod/crl-test/httpbin-5ccc6589c9-6bk8k) due to certificate revocation
2025-12-02T00:27:53.592305Z info tls::crl_watcher:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} closed 1 connection(s) due to certificate revocation
2025-12-02T00:28:11.548205Z warn tls::crl:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} certificate with serial 707454611961486600658165029106210051507929636288 is REVOKED in CRL 1 (issuer: C=US, ST=CA, L=San Francisco, O=Istio, OU=Test, CN=Intermediate CA)
2025-12-02T00:28:11.548256Z warn tls::crl:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} revocation date: ASN1Time { time: 2025-12-02 0:27:28.0 +00:00:00, generalized: false }
2025-12-02T00:28:11.548261Z warn tls::crl:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} intermediate CA certificate at position 0 is REVOKED
2025-12-02T00:28:11.548264Z error tls::workload:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k} Client certificate is REVOKED - rejecting connection
2025-12-02T00:28:11.548316Z warn access connection failed src.addr=10.244.0.8:37822 dst.addr=10.244.0.9:15008 direction="inbound" error="tls handshake error: Custom { kind: InvalidData, error: InvalidCertificate(Revoked) }"
2025-12-02T00:28:11.548350Z error proxy::h2::client:proxy{wl=crl-test/sleep-5f59799959-bkk6f}:outbound{id=7673656d76b37839f3dee3941872e530} Error in HBONE connection handshake: Error { kind: Io(Custom { kind: InvalidData, error: "received fatal alert: CertificateRevoked" }) }
2025-12-02T00:28:11.548462Z error access connection complete src.addr=10.244.0.8:33734 src.workload="sleep-5f59799959-bkk6f" src.namespace="crl-test" src.identity="spiffe://cluster.local/ns/crl-test/sa/sleep" dst.addr=10.244.0.9:15008 dst.hbone_addr=10.244.0.9:80 dst.service="httpbin.crl-test.svc.cluster.local" dst.workload="httpbin-5ccc6589c9-6bk8k" dst.namespace="crl-test" dst.identity="spiffe://cluster.local/ns/crl-test/sa/httpbin" direction="outbound" bytes_sent=0 bytes_recv=0 duration="2ms" error="h2 failed: received fatal alert: CertificateRevoked"Rejecting new connection - 2025-12-02T00:29:25.642613Z warn tls::crl:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4} certificate with serial 707454611961486600658165029106210051507929636288 is REVOKED in CRL 1 (issuer: C=US, ST=CA, L=San Francisco, O=Istio, OU=Test, CN=Intermediate CA)
2025-12-02T00:29:25.642669Z warn tls::crl:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4} revocation date: ASN1Time { time: 2025-12-02 0:27:28.0 +00:00:00, generalized: false }
2025-12-02T00:29:25.642680Z warn tls::crl:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4} intermediate CA certificate at position 0 is REVOKED
2025-12-02T00:29:25.642682Z error tls::workload:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4} Client certificate is REVOKED - rejecting connection
2025-12-02T00:29:25.642755Z warn access connection failed src.addr=10.244.0.10:35048 dst.addr=10.244.0.11:15008 direction="inbound" error="tls handshake error: Custom { kind: InvalidData, error: InvalidCertificate(Revoked) }"
2025-12-02T00:29:25.642808Z error proxy::h2::client:proxy{wl=crl-test-new/sleep-new-85f6c967d-q4sx8}:outbound{id=d8d4f2f2dc6a124d294ce9603e3893eb}Error in HBONE connection handshake: Error { kind: Io(Custom { kind: InvalidData, error: "received fatal alert: CertificateRevoked" }) } |
src/proxy/inbound.rs
Outdated
| // Register TLS connection for CRL monitoring | ||
| let dest_workload = match pi.local_workload_information.get_workload().await { | ||
| Ok(wl) => wl, | ||
| Err(e) => { | ||
| error!( | ||
| "failed to fetch local workload for TLS connection tracking: {}", | ||
| e | ||
| ); | ||
| return Err(proxy::Error::SelfCall); | ||
| } | ||
| }; | ||
|
|
||
| let rbac_ctx = ProxyRbacContext { | ||
| conn: conn.clone(), | ||
| dest_workload, | ||
| }; | ||
|
|
||
| let mut tls_guard = pi | ||
| .connection_manager | ||
| .register_tls_connection(&rbac_ctx, client_cert_serials.clone()); | ||
| let tls_drain = tls_guard.watcher(); | ||
| debug!("registered TLS connection for CRL monitoring"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a second connection guard at a different level? The guard created inside the request_handler seems sufficient. It'll already include the certificates involved in the connection and can be closed if the connection needs to terminate due to revoked certificates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not closing exiting connections properly. Worked for new connections so I had to add this. But let me take a look at it again and validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilrudie, so this is mainly to keep track of idle connections. In my teseting I found that, if we remove TLS guard and rely only on per-request guards:
- for idle TLS connection (no active requests), no guards exist
- so when CRL update happens -> CrlWatcher finds no connections to close
- and, revoked certificate remains connected until next request
wdyt?
No TLS gurd works for new connections since it pretty straight forward check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a correctness issue? A long-lived connection would still be closed.
What remains is only an idle HBONE which can't be used by the application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a correctness issue?
Maybe. My goal is to terminate ongoing connection as well. But open to only rejecting new connections since ongoing connections will eventually timeout.
Signed-off-by: nilekh <[email protected]>
Signed-off-by: nilekh <[email protected]>
Signed-off-by: nilekh <[email protected]>
Signed-off-by: nilekh <[email protected]>
8325fb5 to
5131791
Compare
Signed-off-by: nilekh <[email protected]>
5131791 to
fbabb2c
Compare
|
/test |
|
@nilekhc: The The following commands are available to trigger optional jobs: Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test all |
1 similar comment
|
/test all |
ilrudie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify the behaviors expected in this PR after the WG discussion. From a look through it seems like:
- net-new connections (no previous HBONE established) trying to use revoked certs: DENY
- new application connection (open HBONE) trying to use revoked certs: DENY
- open application connection (application actively sending traffic) presently using a revoked certs: CLOSE
- no application connection (open HBONE): nothing, HBONE connection lives until timeout
Is there any path to an application actually sending traffic with a revoked cert?
I bring this up because my understanding of the WG discussion is that we would move the entirety of this checking to the HBONE layer. New HBONE connections with revoked certs are not accepted, but existing ones continue to be functional. I worry there are edges, I tried to describe one in a comment tied to a line of code in the connection manager, which are tricky to foresee when mixing the HBONE and application properties.
| pub async fn assert_rbac( | ||
| &self, | ||
| state: &DemandProxyState, | ||
| ctx: &ProxyRbacContext, | ||
| dest_service: Option<String>, | ||
| client_cert_serials: Option<Vec<Vec<u8>>>, | ||
| ) -> Result<ConnectionGuard, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I'm not sure where to leave this. It's not a comment about this line of code per se. More a semi-relevant anchor for my thought process.
This effectively ties a property of the HBONE transport being used (certificate serials) to the application connection using it. It's not incorrect to know details about the underlying transport mechanism I think. It means we can deny application level traffic based on the transport which would be used for it.
The rub is, I think we can get caught in a bad state by doing this and not also invalidating the transport underneath. Even if we get updated certificates, the underlying transport is already established and would continue to be used causing denies.
This PR extends the work of CRL support by consuming the ca-crl config map in zTunnel and validates the request accordingly.
Fixes #1678