Skip to content

Conversation

agilgur5
Copy link
Contributor

Summary

Add a note in 1.22 changelog about #11921 et al's CA changes

Details

  • this broke some of my teams' automation code that relied on the CN, so thought it would be good to call out in case anyone else stumbles upon this in order to not spend a few hours debugging 😅
    • this change may particularly impact Prod environments, where the cluster has not been rebuilt in some time (years), and so they will have the old CN while new clusters in lower environments will have the new CN
      • so code that relies on the CN may unexpectedly break Production while working fine in lower environments 😬 😬 😬
        • fortunately we caught this in our QA env, but it passed Dev fine

Testing Evidence

From an internal fix my team made to another team's automation:
k8s ca change

(We've recommend they pull the CA from /etc/kubernetes/pki/ca.crt or /var/run/secrets/kubernetes.io/serviceaccount/ca.crt to be a bit more resilient to changes like this)

Review Notes

Feel free to change the language / wording as you see fit or put in a different part of the release notes (I put this under "Other changes of note" right now).

I thought specifically warning about older clusters was important, as otherwise this is particularly easy to overlook

- this broke some of my teams' automation code that relied on the CN,
  so thought it would be good to call out in case anyone else stumbles
  upon this in order to not spend a few hours debugging
  - this change may particularly impact Prod environments, where the
    cluster has not been rebuilt in some time (years), and so they will
    have the old CN while new clusters in lower environments will have
    the new CN
    - so code that relies on the CN may unexpectedly break Production
      while working fine in lower environments
      - fortunately we caught this in our QA env, but it passed Dev fine
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 16, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @agilgur5. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@agilgur5
Copy link
Contributor Author

@johngmyers any chance you could take a look at this PR?

@johngmyers
Copy link
Member

I must admit to being at a loss as to why anything would depend on the CN having a particular value.

@agilgur5
Copy link
Contributor Author

agilgur5 commented Jun 25, 2023

The automation I mentioned here was for the Vault k8s integration, which requires the k8s CA.

Can see in the screenshot above that this code does TLS inspection to pull the CA, but ends up getting two certs, the CA and the kubernetes-master cert. It uses the CN to distinguish between the two.

I don't think that's the best way to get the CA, as I mentioned in the opening, but regardless, a team I work with already had code that specifically depended on the CN. I.e. the CN was part of the "public surface" that kOps exposes, and changing the CN broke existing code. Whether or not that code is optimal or why that code exists, the fact that any code exists anywhere that relies on the CN is enough to show this.

As is oft stated in Linux kernel development, anything that is exposed will be used (aka the "Workflow" xkcd).

So at the very least, I think this should be documented in the changelog, which is all that this PR does.
Per the opening, I think it is doubly important to document since this has a higher likelihood to break code that depends on the CN in Prod, as those clusters are often older / rebuilt less often and so may still have the old CN. Per opening, code that worked in Dev may actually fail in Prod as a result, with potentially very significant consequences (in the case here, several Vault integrations would fail, eventually resulting in downtime).

@agilgur5
Copy link
Contributor Author

agilgur5 commented Jul 6, 2023

@johngmyers can you take a look again?

@johngmyers
Copy link
Member

I don't think this rises to the level of a release note. This is an implementation detail; the CN of a root CA is not something that is supposed to matter to anything.

@agilgur5
Copy link
Contributor Author

agilgur5 commented Jul 7, 2023

I don't think this rises to the level of a release note.

It's a breaking change that broke actual code in production exactly because it wasn't mentioned in the release notes...

A reasonable user could suggest much more than a tiny release note when a change broke prod. This is the bare minimum, in my opinion.

the CN of a root CA is not something that is supposed to matter to anything.

It did. This is not a hypothetical...
Only one counter-example is needed to show that.

@johngmyers
Copy link
Member

kOps 1.22 has been out of support for about a year. One person making an incorrect assumption about internals does not merit a release note.

@agilgur5
Copy link
Contributor Author

agilgur5 commented Jul 8, 2023

kOps 1.22 has been out of support for about a year.

I'll repeat once more that this breaking change affects clusters that upgraded to 1.22 and beyond and were not rebuilt in that time. Which is, again, much more likely for Prod clusters that have zero downtime (i.e. high criticality).
This affects all clusters that existed pre-1.22 and were upgraded to 1.22+ through the standard kOps upgrade process.

A cluster that was upgraded from 1.21 to 1.26, which is current, would also have this issue. As written above, that cluster, which would be on 1.26, would have the old CN, while clusters built with 1.22+, and similarly on 1.26, would have the newer CN. Two clusters on 1.26 could have two different CNs, despite having the same configuration.

The CA has a different CN depending on what version of kOps the cluster was originally created with (not the version it's on).

One person making an incorrect assumption about internals

I'm not the one who made that assumption (and it was in fact, code reviewed by a whole team that has used kOps for years), and I, myself, literally said there are better ways of implementing their code.
But I also would not have been able to definitively tell them not to rely on that. That is, similarly, undocumented behavior.

It is also a CA, they are not really meant to change frequently either. Whether a CA should be considered an "internal" is debatable.

@agilgur5
Copy link
Contributor Author

agilgur5 commented Jul 8, 2023

One person making an incorrect assumption about internals

the CN of a root CA is not something that is supposed to matter to anything.

One could say that this too, is an assumption. And there is an existing counter-example that shows it did, in fact, matter to something... So one could further say that that is "an incorrect assumption" 🤷

I'm not here to play word games and state opinions, I am stating facts about events that literally already happened.
I am here to warn other users about that, because if we did not catch that in QA, that would have resulted in a major issue to a large enterprise company. I would not wish production outages upon anybody, nor hours of debugging and root cause analysis (which required reading through kOps source code and individual commit diffs), and the absolute least I can do is add a tiny warning to the release notes.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@agilgur5
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2024
@agilgur5
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2024
@agilgur5
Copy link
Contributor Author

/remove-lifecycle stale

@agilgur5
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2025
@agilgur5
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2025
@agilgur5
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 4, 2025
@hakman
Copy link
Member

hakman commented Jul 4, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 4, 2025
@hakman hakman added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 4, 2025
@hakman
Copy link
Member

hakman commented Jul 4, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 4, 2025
@hakman hakman removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 4, 2025
@hakman hakman changed the title docs(release): add note to 1.22 about the CA CN rename docs: Add note to 1.22 about the CA CN rename Jul 4, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 4, 2025
@k8s-ci-robot
Copy link
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hakman hakman merged commit 399d020 into kubernetes:master Jul 4, 2025
11 of 25 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Jul 4, 2025
@agilgur5 agilgur5 deleted the 1.22-release-ca-change branch July 4, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants