Skip to content

Conversation

emilazy
Copy link
Member

@emilazy emilazy commented Jul 20, 2024

Motivation

This is a rework of #6855. It isn’t a perfect solution (e.g. if you're building a bunch of Rust stuff they’ll still use as many cores as they want), and it’d be great to have an implementation of the new and improved GNU Make job server protocol at some point, but it seems like there’s widespread desire for this to come back in some form as it provides a much better UX for workstation users doing large builds, especially when you don’t have copious amounts of spare RAM. I have a draft Nixpkgs stdenv change to use this variable that I will post and link here.

I would like to backport this down to 2.18 so that NixOS and nix-darwin users get a good experience out of the box. I can handle the manual backport work for any versions the bot fails on.

Context

Some notes and questions:

  1. I kept the default at getDefaultCores() rather than using the cores setting. That seems like a reasonable middle ground: system load is a bit of a laggy, imprecise measure, so using cores directly would likely make CPU go to waste by default, but most workstation users probably want to keep total CPU utilization below 100% so they can still use their machine. This way, we don’t regress on the concerns laid out in stdenv/setup.sh: fix parallel make nixpkgs#174473 and treewide: drop -l$NIX_BUILD_CORES nixpkgs#192447. It’d also be kind of a fuss code‐wise to make the default depend on another setting like that. I’m open to bikeshedding on this, though; none and the value of cores also seem like plausible defaults, although I think the former would be optimizing too much for Hydra over the UX for people with normal machines that aren’t exclusively used for build farms. Given that the default for cores is to use all the cores on the machine, and that limiting cores doesn’t limit total system load at all right now, I think this is probably fine.

  2. I’m a bit unsure about how getDefaultCores() interacts with remote builders, and the difference between having cores unset and setting cores = 0. If you have remote builders and the default cores, does each remote builder limit to the number of cores on the machine doing the evaluation? Surely not, right? The Nixpkgs stdenv explicitly handles cores = 0, normalizing it to the result of $(nproc). Is this just a backwards‐compatibility hack and there’s no reason for cores = 0 these days? I want to understand if there’s any reason for both cores = ‹default› and cores = 0 to exist, so that I can understand if it makes sense to mirror this behaviour with load-limit = 0 as my current changes do.

  3. GNU Make and Ninja have inconsistent interpretation of 0 and -1; with GNU Make -l0 means something you’d never want and -l-1 disables the limit, and with Ninja -l0 disables the limit and I think -l-1 might be invalid. I decided that it would be better to have an explicit non‐integer value to disable the load limit entirely, and to make load-limit = 0 the same as cores = 0, but obviously (3) is a load‐bearing consideration here.

  4. Relatedly, is load-limit = none okay? Nothing uses that syntax currently. It seemed less obscure than using the empty string.

  5. I don’t understand what the src/libstore/daemon.cc thing is doing. I copied it from libstore+nix-build: add load-limit setting and use its value for NIX_LOAD_LIMIT en var #8105. Please tell me what it does and if it’s good or bad!

Closes: #7091
Closes: #6855
Closes: #8105

cc @fricklerhandwerk @centromere @aviallon @reckenrode @ElvishJerricco

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added documentation with-tests Issues related to testing. PRs with tests have some priority labels Jul 20, 2024
@emilazy
Copy link
Member Author

emilazy commented Jul 20, 2024

Untested Nixpkgs PR to show the other side of the interface: NixOS/nixpkgs#328677.

@emilazy
Copy link
Member Author

emilazy commented Jul 20, 2024

Oops, looks like some of the dead code I cleaned up before pushing was actually load‐bearing, thanks C++ 🙃 Working on it…

@emilazy emilazy force-pushed the push-tyzwwmvlwvwo branch from cdadc73 to 58e5be9 Compare July 20, 2024 15:36
emilazy added a commit to emilazy/nixpkgs that referenced this pull request Jul 20, 2024
@emilazy emilazy force-pushed the push-tyzwwmvlwvwo branch from 58e5be9 to 34f2477 Compare July 20, 2024 16:02
@emilazy
Copy link
Member Author

emilazy commented Jul 20, 2024

I don’t understand why the test is failing only on Linux. Do I need to do more for cache‐busting?

@aviallon
Copy link

Thanks for the work. I can't wait to drop my local patches.
It's always a pain.

# Release X.Y (202?-??-??)

- Add a `load-limit` setting to control builder parallelism. This has
also been backported to the 2.18 and later release branches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release notes should now be done as separate files in doc/manual/rl-next/, they'll be concatenated when we do a release.

@tomberek
Copy link
Contributor

Needs a decision to either merge or close. How is the experience for those who have been carrying this patch?

@emilazy
Copy link
Member Author

emilazy commented Aug 17, 2024

The patch doesn’t really do anything currently, because it depends on the linked changes to the Nixpkgs standard environment, which depend on this PR being accepted. (Perhaps @aviallon rebuilds the world locally to be able to use this, or just hacks it in for specific problem builds?)

However, it allows people to regain the behaviour from before the unconditional load limit in stdenv was removed a few years ago for the benefit of Hydra. Many people I have talked to found the old behaviour to be far preferable on workstation machines when doing large‐scale builds, which is why this is the third pull request now trying to work with Nix to add proper support for this feature. I have not yet subjected NixOS/nixpkgs#328677 to intense scrutiny from Nixpkgs reviewers because of not yet knowing whether this PR will go anywhere, but from what I can tell from talking about it in the development channels and reading old PRs and issues, nobody objects to the basic idea and a good number of people are eager to see this functionality.

This isn’t ready to merge because I need to address the review comment but also because I’m waiting on responses to all my questions in the PR message and #11143 (comment) from people who understand the Nix codebase better than me. If progress is made on those, I’ll update this PR accordingly and seek wider review of my stdenv changes.

@tomberek tomberek added the idea approved The given proposal has been discussed and approved by the Nix team. An implementation is welcome. label Aug 26, 2024
@tomberek
Copy link
Contributor

Idea approved. This is ready for wider review and testing against Nixpkgs and larger workloads. Small changes for docs requested above and test fixups. Assigning @edolstra for review.

@aviallon
Copy link

aviallon commented Oct 1, 2024

Sorry for the spam, but is there any reason why this got stale?
If needed, I can do some testing.

To answer @emilazy, what I did was hacking this feature in some specific problematic packages (hello anything using LTO), by modifying makeFlags or similar to use -l ${NIX_LOAD_LIMIT}.

@emilazy
Copy link
Member Author

emilazy commented Oct 1, 2024

I don’t know why this has stalled. I’ve been waiting for my questions from the PR message to be answered and for general review from @edolstra. Testing would be great, but parts of the approach may presumably have to change depending on the review and the answers to the things I’m uncertain about. Review of how the approach works out on the Nixpkgs side in NixOS/nixpkgs#328677 would also be good.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/limit-cpu-usage-when-building/61099/5

@DavHau
Copy link
Member

DavHau commented Jun 19, 2025

Fixing this issue would be quite important.
As of now, nix is fundamentally incompatible with fast computers.

On anything with more than 16 cores, nix has to be tamed to not use all of the cores, or an insane investment in memory is required.
For a build tool this is a very unfortunate situation.

It would be great if the nix team could focus on moving this forward.

@emilazy
Copy link
Member Author

emilazy commented Jun 19, 2025

From my side of things I'm happy to do the work to rebase this and address review feedback if it'll help drive it forward! I'm confident we can land support for this in Nixpkgs once we know what the interface will look like. I'm just waiting to hear back about the questions I raised in the PR message.

@aviallon
Copy link

aviallon commented Jun 20, 2025

@emilazy perhaps we need another reviewer with more time on its hand? I suggest you rebase the PR to make it look alive, and have the CI do its thing.

@tomberek
Copy link
Contributor

Idea has been approved. I'm not aware of any objections. Logs for Linux failure are no longer available, but it seems like something that can be fixed addressed.

@emilazy
Copy link
Member Author

emilazy commented Jun 22, 2025

I‘m happy to do the rebase and look into the test failure, but as I said in #11143 (comment) it‘s hard to move forward with this without answers to the questions in #11143 (comment) from someone who is more of an expert in the codebase than me, so I‘m hesitant to put the time into rebasing it without knowing which parts of the code/interface will need changing :)

@tomberek
Copy link
Contributor

  1. Remote builds get their core numbers locally or from a machines file.
  2. Make takes a "-l" with no number as removing any setting. These are unfortunate conventions. So "0" mean "auto" and "none" or means disabled? Not sure what would be most consistent between unset and a setting that means unset.
  3. It does look awkward. It would make someone think that can be used for "-j" as well. Feels inconsistent.
  4. @Ericson2314 this is related to setting refactoring?

@tomberek
Copy link
Contributor

tomberek commented Jul 2, 2025

There is some interaction with: #13402

env["NIX_BUILD_TOP"] = env["TMPDIR"] = env["TEMPDIR"] = env["TMP"] = env["TEMP"] = tmp;
env["NIX_STORE"] = store->storeDir;
env["NIX_BUILD_CORES"] = std::to_string(settings.buildCores);
if (settings.loadLimit.get())
Copy link
Contributor

@xokdvium xokdvium Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this throw a bad_optional_access when load-limit is not specified? It seems that the correct thing to do here is to perigee use an implicit conversion to bool or loadLimit.has_value().

Sorry, I've missed that this is the call to the Setting::get

@philiptaron
Copy link
Contributor

I'm really excited about this PR being merged now that #13402 is part of 2.31.0 -- I regularly bog my machine down and have to fiddle with low -j settings to get unwedged.

@chrisheib
Copy link

chrisheib commented Sep 5, 2025

This is cool! In my observations (see my comment on discourse) the swapping of ram is the true build killer. A heuristic should try to optimize ram usage while preventing swapping at all cost IMO.

See my current build: Top graph shows the cpu being bored while wanting to run at full load, bottom graph shows the disk io due to swapping, which absolutely tanks performance. I suspect the system-load-average would signal 'full steam ahead' to a controller.

image

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/i-o-cpu-scheduling-jobs-cores-and-performance-baby/66120/9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation idea approved The given proposal has been discussed and approved by the Nix team. An implementation is welcome. with-tests Issues related to testing. PRs with tests have some priority

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Allow configuration of load limit for nix builds

10 participants