Skip to content

Conversation

robn
Copy link
Member

@robn robn commented Sep 15, 2025

[Sponsors: Klara, Inc., Fastmail Pty Ltd]

Motivation and Context

This PR came out of customer investigation into the memory reclaim issues in 2.3.0 that were eventually identified as a regression and resolved by #17487.

That change is undoubtedly correct, ie, when the kernel comes asking for memory, we need to give back what we can. We were tackling it from the other side, trying to release things earlier if both the kernel and OpenZFS agree that they won't be needed again.

This PR adds the scaffolding and a pair of tunables for the kernel's inode and dentry reclaim queries. In both cases, the kernel asks the filesystem "I don't need these any more, should I free them or cache them?". We didn't implement these queries at all at all; this PR adds them, with tunables to control simple "always" and "never" responses, with the defaults set to "always" to keep the traditional behaviour.

The intent at this point is to have a couple of extra hooks available for debugging related issues at customer sites (with somewhat finer control that echo 3 > drop_caches, the brute force way to achieve same), but also some shapes to allow experimentation with cache control policy (eg #17159).

Description

  • Implement super_operations.drop_inode(), called by the kernel when it wants to drop the last inode reference. The return decides if it should destroy the inode immediately, or put it on the s_inodes list in the superblock. The tunable zfs_delete_inode decides if the kernel's generic_delete_inode() or generic_drop_inode() policy functions are called, which signal either an immediate destroy, or (default) cache if the file still exists on disk.

  • Implement dentry_operations.d_delete(), called by the kernel when it wants to drop the last dentry reference. The return decides if it should destroy the dentry immediately, or return it to the dentry cache. The zfs_delete_dentry tunable directly decides this; the default of 0 returns it to the cache. Worth noting that while we don't interact directly with the dentry cache, every active dentry holds an inode reference, which does pin the inode as above.

How Has This Been Tested?

Compiled back to 4.19, ZTS run on 6.1.x completed.

I ran both of these "on" (destroy everything immediately) as a daily driver for a month. Mostly unnoticeable on my under-loaded laptop, but at least, nothing crashed or broke.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Member Author

robn commented Sep 15, 2025

Ahh, doesn't compile on 6.17, because of the change to d_op. It'll be easy; I'll sort it tomorrow.

@amotin
Copy link
Member

amotin commented Sep 15, 2025

I don't know what is the kernel's motivation to give this flexibility to the file system, but I doubt it was just to make user configure it. I won't object too hard, but this is not a kind of tunables I prefer. We should try to be more clever.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the number of performance issues we've seen over the years related to how the ARC+DBUF caches can interact with the kernel's inode and dentry caches I'm okay adding these tunables. They really shouldn't be needed, but I'm sure having them available with help us experiment with better tuning the behavior.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 15, 2025
@youzhongyang
Copy link
Contributor

It's nice to have the flexibility of controlling the release of inode and dentry data in memory.

I can throw heavy load to test this PR once the compiling error in 6.17 is fixed.

@robn robn force-pushed the inode-dentry-reclaim branch from bd5cce2 to 3c4baf6 Compare September 15, 2025 23:41
@robn
Copy link
Member Author

robn commented Sep 15, 2025

Last push addresses the nits, and adds the support for 6.17 (replaces setting sb->s_d_op directly with a utility function set_default_d_op(sb, dop)).

@robn
Copy link
Member Author

robn commented Sep 15, 2025

I don't know what is the kernel's motivation to give this flexibility to the file system, but I doubt it was just to make user configure it.

@amotin the motivation is that it lets the filesystem set its own caching policy and/or do all the caching itself, because it has a better idea of the cost of reinflating an inode/dentry vs holding it in memory.

We should try to be more clever.

Agreed; I would prefer the ARC understand and manage this too; possibly through the dbuf cache, not sure (I have ideas of course). For now, this was intended as the initial hook.

I won't object too hard, but this is not a kind of tunables I prefer.

Yeah, I agree. My position on tunables is that they shouldn't exist at all, as in, there should be more sensible and cross-platform ways to configure the system, including self-tuning. But for now they're the thing we have.

These I definitely don't want to see used as a general tuning mechanism though; I don't want to have to support them. I wonder if they'd be better named with a debug_ prefix (or some other thing to make it clear you're doing something weird). Happy to change for this PR, if anyone has a strong opinion?

@behlendorf
Copy link
Contributor

or some other thing to make it clear you're doing something weird

Skimming the zfs.4 man page we added the phrase "Intended for testing/debugging." to the zfs_scan_suspend_progress entry which I kind of like. I wouldn't be against adding something like this to the docs.

Traditionally, unused inodes would be held on the superblock inode cache
until the associated on-disk file is removed or the kernel requests
reclaim.  On filesystems with millions of rarely-used files, this can be
a lot of unusable memory.

Here we implement the superblock drop_inode method, and add a
zfs_delete_inode tunable to control its behaviour. By default it
continues the traditional behaviour, but when the tunable is enabled, we
signal that the inode should be deleted immediately when the last
reference is dropped, rather than cached. This releases the associated
data to the dbuf cache and ARC, allowing them to be reclaimed normally.

Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Signed-off-by: Rob Norris <[email protected]>
Traditionally, unused dentries would be cached in the dentry cache until
the associated entry is no longer on disk. The cached dentry continues
to hold an inode reference, causing the inode to be pinned (see previous
commit).

Here we implement the dentry op d_delete, which is roughly analogous to
the drop_inode superblock op, and add a zfs_delete_dentry tunable to
control its behaviour. By default it continues the traditional
behaviour, but when the tunable is enabled, we signal that an unused
dentry should be freed immediately, releasing its inode reference, and
so allowing that inode to be deleted if no longer in use.

Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the inode-dentry-reclaim branch from 3c4baf6 to 6f150e1 Compare September 16, 2025 00:43
@robn
Copy link
Member Author

robn commented Sep 16, 2025

Skimming the zfs.4 man page we added the phrase "Intended for testing/debugging." to the zfs_scan_suspend_progress entry which I kind of like. I wouldn't be against adding something like this to the docs.

@behlendorf this is good, done!

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if you see it useful for anything.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 17, 2025
behlendorf pushed a commit that referenced this pull request Sep 17, 2025
Traditionally, unused dentries would be cached in the dentry cache until
the associated entry is no longer on disk. The cached dentry continues
to hold an inode reference, causing the inode to be pinned (see previous
commit).

Here we implement the dentry op d_delete, which is roughly analogous to
the drop_inode superblock op, and add a zfs_delete_dentry tunable to
control its behaviour. By default it continues the traditional
behaviour, but when the tunable is enabled, we signal that an unused
dentry should be freed immediately, releasing its inode reference, and
so allowing that inode to be deleted if no longer in use.

Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17746
@youzhongyang
Copy link
Contributor

arc-graph cpu-graph nfs-smb-ops net-pps

I tested this PR with relatively heavy load (nfs clients and win64 clients), see the above ARC graph and CPU usage graph. Here is the conclusion:

  • With zfs_delete_inode and zfs_delete_dentry set to 1, the system uses less ARC, approximately 1/3 of when setting to 0.
  • However, aggressively removing inodes and dentries from the memory leads to more system CPU time, and thus less I/O throughput.

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Sep 17, 2025
Traditionally, unused inodes would be held on the superblock inode cache
until the associated on-disk file is removed or the kernel requests
reclaim.  On filesystems with millions of rarely-used files, this can be
a lot of unusable memory.

Here we implement the superblock drop_inode method, and add a
zfs_delete_inode tunable to control its behaviour. By default it
continues the traditional behaviour, but when the tunable is enabled, we
signal that the inode should be deleted immediately when the last
reference is dropped, rather than cached. This releases the associated
data to the dbuf cache and ARC, allowing them to be reclaimed normally.

Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17746
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Sep 17, 2025
Traditionally, unused dentries would be cached in the dentry cache until
the associated entry is no longer on disk. The cached dentry continues
to hold an inode reference, causing the inode to be pinned (see previous
commit).

Here we implement the dentry op d_delete, which is roughly analogous to
the drop_inode superblock op, and add a zfs_delete_dentry tunable to
control its behaviour. By default it continues the traditional
behaviour, but when the tunable is enabled, we signal that an unused
dentry should be freed immediately, releasing its inode reference, and
so allowing that inode to be deleted if no longer in use.

Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17746
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants