Skip to content

Conversation

laverya
Copy link
Member

@laverya laverya commented Jan 29, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR require a test?

Does this PR require a release note?


Does this PR require documentation?

@github-actions
Copy link

github-actions bot commented Jan 29, 2025

This PR has been released (on staging) and is available for download with a embedded-cluster-smoke-test-staging-app license ID.

Online Installer:

curl "https://staging.replicated.app/embedded/embedded-cluster-smoke-test-staging-app/ci/appver-dev-7d574c1" -H "Authorization: $EC_SMOKE_TEST_LICENSE_ID" -o embedded-cluster-smoke-test-staging-app-ci.tgz

Airgap Installer (may take a few minutes before the airgap bundle is built):

curl "https://staging.replicated.app/embedded/embedded-cluster-smoke-test-staging-app/ci-airgap/appver-dev-7d574c1?airgap=true" -H "Authorization: $EC_SMOKE_TEST_LICENSE_ID" -o embedded-cluster-smoke-test-staging-app-ci.tgz

Happy debugging!

Copy link
Member

@adamancini adamancini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

adamancini
adamancini previously approved these changes Jan 29, 2025
@ajp-io
Copy link
Member

ajp-io commented Jan 29, 2025

Why does the limit need to be this high? Can EC necessarily not work unless the limit is this high? When I get a GCP instance, the limit is 128. I'm hesitant to add preflights that will fail unless they're absolutely necessary.

@ajp-io
Copy link
Member

ajp-io commented Jan 29, 2025

If it is absolutely necessary, we could also consider setting this sysctl like we do for ip forwarding, and then the preflight only fails if we can't set it ourselves at the beginning. That way we don't introduce a lot of friction with failing preflights.

@adamancini
Copy link
Member

@ajp-io the number can be up to 2^31, based on how much RAM is available to the system - a more precise estimate probably depends heavily on the # of pods that you deploy and what they're doing. 8192 is probably high enough to cover the majority of clusters, but this is kind of a "you have to tune this" parameter in the kernel. I have run into problems during upgrades in my EC cluster on GCP with the default 128 in Ubuntu.

I often don't hit this on a fresh cluster but only after application pods start to get deployed.

I imagine this event had helm charts installed as part of EC, so a greater than averge number of pods are started at first installation

@adamancini
Copy link
Member

@ajp-io I do think we should probably put it into a drop-in - there are some cases where sysctl dropins get overridden by end user configs, like via Puppet or Ansible - so I support a preflight, supportbundle analyzer, and we should install a sysctl drop-in config.

@jtuchscherer
Copy link

Let's hold off on merging and talk through the implications of adding this pre-flight check

@laverya
Copy link
Member Author

laverya commented Jan 29, 2025

Why does the limit need to be this high? Can EC necessarily not work unless the limit is this high? When I get a GCP instance, the limit is 128. I'm hesitant to add preflights that will fail unless they're absolutely necessary.

I was just on a support case where 128 was preventing k0s from functioning properly - that's what prompted this

@ajp-io
Copy link
Member

ajp-io commented Jan 30, 2025

a more precise estimate probably depends heavily on the # of pods that you deploy and what they're doing

If this is customer-specific, would it be better to support vendor-supplied host preflight checks and have ITRS and anyone else who needs it set this themselves? ITRS seems like they have more charts (which I image translates to more pods) than many/most vendors. That makes me think many people won't be affected by this issue but will potentially face unneeded preflight failures anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants