Skip to content

Conversation

@nilekhc
Copy link

@nilekhc nilekhc commented Oct 30, 2025

This PR extends the work of CRL support by consuming the ca-crl config map in zTunnel and validates the request accordingly.

Fixes #1678

@istio-testing
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Oct 30, 2025
@istio-policy-bot
Copy link

😊 Welcome @nilekhc! This is either your first contribution to the Istio ztunnel repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 30, 2025
@nilekhc nilekhc changed the title feat: implements ca-crl in zTunnel feat: implements crl support in zTunnel Oct 30, 2025
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@howardjohn
Copy link
Member

I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 6cfcad3 to c522f5f Compare October 30, 2025 15:33
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this

Hi @howardjohn, Thanks for the comment. The original design I presented in one of the community meetings received some valuable feedback. Based on that, I pivoted the design to propagate CRL information through a ConfigMap, similar to how the ca-root-cert is propogated. I’ve captured these updates in the same design document, under the Implementation section.

This PR simply uses that CRL ConfigMap and validates the request accordingly.

cc @costinm @therealmitchconnors @keithmattix

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from c522f5f to ce5886a Compare October 30, 2025 16:20
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from ce5886a to 2402c4d Compare October 30, 2025 16:34
@istio-testing istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 30, 2025
@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/test all

@nilekhc nilekhc marked this pull request as ready for review October 30, 2025 16:43
@nilekhc nilekhc requested a review from a team as a code owner October 30, 2025 16:43
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Oct 30, 2025
socket2::SockRef::from(&s).set_tcp_keepalive(&ka)
);
}
#[cfg(target_os = "linux")]
Copy link
Author

@nilekhc nilekhc Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting an error on macos due to cross compilation. I have to add this to some places to fix it. Please let me know if there is a better way to handle this.

@nilekhc
Copy link
Author

nilekhc commented Oct 30, 2025

/retest


/// Close all inbound and outbound connections
/// This is called when certificate revocations are detected
pub fn close_all_connections(&self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really dangerous and bad behavior. How strongly do you feel about keeping this?

Does any other application do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I don't think we should be closing every single connection just because of one revocation. What does Envoy do?

Copy link
Author

@nilekhc nilekhc Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I addd this since http connections are long lived and does not check updated crl. Is there a better way to handle this? I know envoy has some timeout we can configure. Not comepletly sure how zTunnel handles it. Happy to tweak this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is to do nothing. But at the very least, only close ones that are revoked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing nothing feels wrong IMO; the purpose of a CRL is to be able to break a particular trust chain, and you don't want existing connections to continue. Only closing the ones that are revoked is the right solution to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is restart your clients if you want the CRL to take effect quickly the UX we want then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, zTunnel should terminate that connection selectively and avoid restarting the workload. Since the user provides/updates the CRL file, they’re already aware of potentially revoked certificates, so why ask them to take action twice?
(1) Create or update the CRL, and
(2) Restart the workload pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playing advocate aside, I suspect that most folks would accept a limitation where existing connections are not re-evaluated on a CRL refresh. This seems to be a fairly common behavior. I think I would lean towards breaking support for re-evaluating existing connections into a separate PR (assuming we do want to explore that functionality).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR seems reasonable to me so we can evaluate the complexity on its own

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the outbound related changes and now it only selectively terminates inbound connections. PTAL. Thanks

/cc @howardjohn @keithmattix @ilrudie

@therealmitchconnors
Copy link
Contributor

I'm a bit confused. It seems there was a design doc that was approved for this (not sure by who, as I was out of office and there are no meeting notes or listed approvers) but the design is completely different than this

If you'd like to see the conversation where this was reviewed, it's in the 4/14 wg meeting at https://www.youtube.com/watch?v=welIiuqUvn0&t=118s

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Nov 18, 2025
@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 5509cf6 to 37f0ca0 Compare November 19, 2025 01:11
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Nov 19, 2025
@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch 2 times, most recently from 11a8b0b to 0f855bf Compare November 19, 2025 19:15
@nilekhc
Copy link
Author

nilekhc commented Nov 20, 2025

@howardjohn @ilrudie @keithmattix Could you PTAL when you get a chance? Thanks.

Copy link
Contributor

@ilrudie ilrudie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me this bundles functionality that belongs in something like PolicyWatcher into the ConnectionManager. Please take a look at that, specifically how run works and is invoked and works for an example of how similar functionality was implemented.

Comment on lines 198 to 208
let rbac_ctx = ProxyRbacContext {
conn: conn.clone(),
dest_workload: destination_workload,
};
let mut tls_guard = pi
.connection_manager
.register_tls_connection(&rbac_ctx, client_cert_serials.clone());
debug!("TLS connection registered for revocation tracking");

// Get watcher BEFORE serving to listen for revocation-based termination
let tls_drain_watch = tls_guard.watcher();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serv_connect will already create a connection guard which will end the connection when asked. I don't think we need a second connection guard

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a commit didn't get pushed. We still have a second connection guard being created

Comment on lines 239 to 255
// This allows CRL revocation to terminate the connection immediately
tokio::select! {
result = serve => {
tls_guard.release();
result
}
_ = tls_drain_watch.wait_for_drain() => {
debug!("TLS connection terminated due to certificate revocation");
tls_guard.release();
// Return an error to signal abnormal termination
// This helps with proper cleanup and error reporting
Err(std::io::Error::new(
std::io::ErrorKind::ConnectionAborted,
"connection terminated due to certificate revocation"
).into())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above, we already take a guard that closes the connection, I don't think we need

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

@nilekhc nilekhc requested review from howardjohn and ilrudie December 1, 2025 23:58
@nilekhc
Copy link
Author

nilekhc commented Dec 2, 2025

It seems to me this bundles functionality that belongs in something like PolicyWatcher into the ConnectionManager. Please take a look at that, specifically how run works and is invoked and works for an example of how similar functionality was implemented.

Thanks for the pointers @ilrudie. I refactored the code with CRLWatcher much like PolicyWatcher. PTAL Thanks.

@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 9d7986a to 2071f5f Compare December 2, 2025 00:30
@nilekhc
Copy link
Author

nilekhc commented Dec 2, 2025

Local testing -

Rejecting exisiting connection:

2025-12-02T00:27:53.591237Z	warn	tls::crl	detected 1 NEW certificate revocation(s)
2025-12-02T00:27:53.591338Z	warn	tls::crl	  newly revoked serial: [123, 235, 92, 169, 143, 75, 240, 223, 161, 240, 41, 122, 127, 234, 128, 208, 103, 54, 93, 192]
2025-12-02T00:27:53.591387Z	info	tls::crl	NEW REVOCATIONS DETECTED - CRL watcher will handle connection closures
2025-12-02T00:27:53.591552Z	info	proxy::h2::server:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	starting graceful drain (TLS connection certificate revoked)
2025-12-02T00:27:53.592162Z	info	tls::crl_watcher:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	closed connection 10.244.0.8:46424(spiffe://cluster.local/ns/crl-test/sa/sleep)->10.244.0.9:15008 (Kubernetes//Pod/crl-test/httpbin-5ccc6589c9-6bk8k) due to certificate revocation
2025-12-02T00:27:53.592305Z	info	tls::crl_watcher:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	closed 1 connection(s) due to certificate revocation
2025-12-02T00:28:11.548205Z	warn	tls::crl:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	certificate with serial 707454611961486600658165029106210051507929636288 is REVOKED in CRL 1 (issuer: C=US, ST=CA, L=San Francisco, O=Istio, OU=Test, CN=Intermediate CA)
2025-12-02T00:28:11.548256Z	warn	tls::crl:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	revocation date: ASN1Time { time: 2025-12-02 0:27:28.0 +00:00:00, generalized: false }
2025-12-02T00:28:11.548261Z	warn	tls::crl:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	intermediate CA certificate at position 0 is REVOKED
2025-12-02T00:28:11.548264Z	error	tls::workload:proxy{wl=crl-test/httpbin-5ccc6589c9-6bk8k}	Client certificate is REVOKED - rejecting connection
2025-12-02T00:28:11.548316Z	warn	access	connection failed	src.addr=10.244.0.8:37822 dst.addr=10.244.0.9:15008 direction="inbound" error="tls handshake error: Custom { kind: InvalidData, error: InvalidCertificate(Revoked) }"
2025-12-02T00:28:11.548350Z	error	proxy::h2::client:proxy{wl=crl-test/sleep-5f59799959-bkk6f}:outbound{id=7673656d76b37839f3dee3941872e530}	Error in HBONE connection handshake: Error { kind: Io(Custom { kind: InvalidData, error: "received fatal alert: CertificateRevoked" }) }
2025-12-02T00:28:11.548462Z	error	access	connection complete	src.addr=10.244.0.8:33734 src.workload="sleep-5f59799959-bkk6f" src.namespace="crl-test" src.identity="spiffe://cluster.local/ns/crl-test/sa/sleep" dst.addr=10.244.0.9:15008 dst.hbone_addr=10.244.0.9:80 dst.service="httpbin.crl-test.svc.cluster.local" dst.workload="httpbin-5ccc6589c9-6bk8k" dst.namespace="crl-test" dst.identity="spiffe://cluster.local/ns/crl-test/sa/httpbin" direction="outbound" bytes_sent=0 bytes_recv=0 duration="2ms" error="h2 failed: received fatal alert: CertificateRevoked"

Rejecting new connection -

2025-12-02T00:29:25.642613Z	warn	tls::crl:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4}	certificate with serial 707454611961486600658165029106210051507929636288 is REVOKED in CRL 1 (issuer: C=US, ST=CA, L=San Francisco, O=Istio, OU=Test, CN=Intermediate CA)
2025-12-02T00:29:25.642669Z	warn	tls::crl:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4}	revocation date: ASN1Time { time: 2025-12-02 0:27:28.0 +00:00:00, generalized: false }
2025-12-02T00:29:25.642680Z	warn	tls::crl:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4}	intermediate CA certificate at position 0 is REVOKED
2025-12-02T00:29:25.642682Z	error	tls::workload:proxy{wl=crl-test-new/httpbin-new-7df9bf6664-v6lz4}	Client certificate is REVOKED - rejecting connection
2025-12-02T00:29:25.642755Z	warn	access	connection failed	src.addr=10.244.0.10:35048 dst.addr=10.244.0.11:15008 direction="inbound" error="tls handshake error: Custom { kind: InvalidData, error: InvalidCertificate(Revoked) }"
2025-12-02T00:29:25.642808Z	error	proxy::h2::client:proxy{wl=crl-test-new/sleep-new-85f6c967d-q4sx8}:outbound{id=d8d4f2f2dc6a124d294ce9603e3893eb}Error in HBONE connection handshake: Error { kind: Io(Custom { kind: InvalidData, error: "received fatal alert: CertificateRevoked" }) }

Comment on lines 158 to 179
// Register TLS connection for CRL monitoring
let dest_workload = match pi.local_workload_information.get_workload().await {
Ok(wl) => wl,
Err(e) => {
error!(
"failed to fetch local workload for TLS connection tracking: {}",
e
);
return Err(proxy::Error::SelfCall);
}
};

let rbac_ctx = ProxyRbacContext {
conn: conn.clone(),
dest_workload,
};

let mut tls_guard = pi
.connection_manager
.register_tls_connection(&rbac_ctx, client_cert_serials.clone());
let tls_drain = tls_guard.watcher();
debug!("registered TLS connection for CRL monitoring");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a second connection guard at a different level? The guard created inside the request_handler seems sufficient. It'll already include the certificates involved in the connection and can be closed if the connection needs to terminate due to revoked certificates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not closing exiting connections properly. Worked for new connections so I had to add this. But let me take a look at it again and validate.

Copy link
Author

@nilekhc nilekhc Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilrudie, so this is mainly to keep track of idle connections. In my teseting I found that, if we remove TLS guard and rely only on per-request guards:

  • for idle TLS connection (no active requests), no guards exist
  • so when CRL update happens -> CrlWatcher finds no connections to close
  • and, revoked certificate remains connected until next request

wdyt?

No TLS gurd works for new connections since it pretty straight forward check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a correctness issue? A long-lived connection would still be closed.

What remains is only an idle HBONE which can't be used by the application?

Copy link
Author

@nilekhc nilekhc Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a correctness issue?

Maybe. My goal is to terminate ongoing connection as well. But open to only rejecting new connections since ongoing connections will eventually timeout.

@nilekhc nilekhc requested a review from ilrudie December 2, 2025 22:12
@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 8325fb5 to 5131791 Compare December 3, 2025 22:55
@istio-testing istio-testing added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 3, 2025
@nilekhc nilekhc force-pushed the nilekh/f/crl-support branch from 5131791 to fbabb2c Compare December 3, 2025 23:02
@nilekhc
Copy link
Author

nilekhc commented Dec 3, 2025

/test

@istio-testing
Copy link
Contributor

@nilekhc: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test test

The following commands are available to trigger optional jobs:

/test bench

Use /test all to run the following jobs that were automatically triggered:

test_ztunnel

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nilekhc
Copy link
Author

nilekhc commented Dec 3, 2025

/test all

1 similar comment
@nilekhc
Copy link
Author

nilekhc commented Dec 3, 2025

/test all

Copy link
Contributor

@ilrudie ilrudie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify the behaviors expected in this PR after the WG discussion. From a look through it seems like:

  • net-new connections (no previous HBONE established) trying to use revoked certs: DENY
  • new application connection (open HBONE) trying to use revoked certs: DENY
  • open application connection (application actively sending traffic) presently using a revoked certs: CLOSE
  • no application connection (open HBONE): nothing, HBONE connection lives until timeout

Is there any path to an application actually sending traffic with a revoked cert?

I bring this up because my understanding of the WG discussion is that we would move the entirety of this checking to the HBONE layer. New HBONE connections with revoked certs are not accepted, but existing ones continue to be functional. I worry there are edges, I tried to describe one in a comment tied to a line of code in the connection manager, which are tricky to foresee when mixing the HBONE and application properties.

Comment on lines 185 to 191
pub async fn assert_rbac(
&self,
state: &DemandProxyState,
ctx: &ProxyRbacContext,
dest_service: Option<String>,
client_cert_serials: Option<Vec<Vec<u8>>>,
) -> Result<ConnectionGuard, Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I'm not sure where to leave this. It's not a comment about this line of code per se. More a semi-relevant anchor for my thought process.

This effectively ties a property of the HBONE transport being used (certificate serials) to the application connection using it. It's not incorrect to know details about the underlying transport mechanism I think. It means we can deny application level traffic based on the transport which would be used for it.

The rub is, I think we can get caught in a bad state by doing this and not also invalidating the transport underneath. Even if we get updated certificates, the underlying transport is already established and would continue to be used causing denies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add peer CA CRL support in zTunnel

7 participants