Skip to content

Conversation

chen1195585098
Copy link
Contributor

A newly created linkfile will be deleted for the race, after this, the gfid between linkfile and datafile may get mismatched. And shards deletion process is also affected by such a mismatch, most shards in BRICK_PATH/.shard cannot be removed correctly and space cannot be release.

To fix this race, replace mknod with create during linkfile creation, in this case, we can keep this linkfile opening during the whole linkfile and datafile creations.

A linkfile will not be regarded as stale and deleted then, because its fd is opened and posix_unlink can handle it properly.

Fixes: #4548

@gluster-ant
Copy link
Collaborator

Can one of the admins verify this patch?

1 similar comment
@gluster-ant
Copy link
Collaborator

Can one of the admins verify this patch?

@chen1195585098
Copy link
Contributor Author

During linkfile and datafile creations, the linkfile is kept opening.

[root@chenjinhao-pc ~]# ps aux|grep glusterfsd
root       44498  0.0  0.0 1221832 23704 ?       SLsl 15:30   0:00 /usr/local/sbin/glusterfsd -s 192.168.122.211 --volfile-id issue.192.168.122.211.brick1 -p /var/run/gluster/vols/issue/192.168.122.211-brick1.pid -S /var/run/gluster/605c42f9050e3ce0.socket --brick-name /brick1 -l /var/log/glusterfs/bricks/brick1.log --xlator-option *-posix.glusterd-uuid=0f58ea27-f808-47b2-9e92-5aed82a3091f --process-name brick --brick-port 59028 --xlator-option issue-server.listen-port=59028
root       44514  0.0  0.0 1221540 23652 ?       SLsl 15:30   0:00 /usr/local/sbin/glusterfsd -s 192.168.122.211 --volfile-id issue.192.168.122.211.brick2 -p /var/run/gluster/vols/issue/192.168.122.211-brick2.pid -S /var/run/gluster/a152b698ffdd88a0.socket --brick-name /brick2 -l /var/log/glusterfs/bricks/brick2.log --xlator-option *-posix.glusterd-uuid=0f58ea27-f808-47b2-9e92-5aed82a3091f --process-name brick --brick-port 51591 --xlator-option issue-server.listen-port=51591

[root@chenjinhao-pc ~]# ll /proc/44498/fd|grep -w 9
l-wx------. 1 root root 64  6月18日 15:32 272 -> /brick1/9
lrwx------. 1 root root 64  6月18日 15:32 9 -> socket:[94888]
[root@chenjinhao-pc ~]# ll /proc/44514/fd|grep -w 9
l-wx------. 1 root root 64  6月18日 15:32 272 -> /brick2/9
lrwx------. 1 root root 64  6月18日 15:32 9 -> socket:[108826]

And a newly created linkfile will not be deleted because its fd is opend and flag open-fd-key-status is set accordingly. So, posix_unlink can handle this properly.

[root@chenjinhao-pc 1]# grep -rn posix_unlink /var/log/glusterfs/bricks/brick{1,2}.log|grep "2025-06-18"
/var/log/glusterfs/bricks/brick1.log:35587:[2025-06-18 07:33:36.604831 +0000] I [MSGID: 113030] [posix-entry-ops.c:1401:posix_unlink] 0-issue-posix: open-fd-key-status: 1 for /brick1/9 
/var/log/glusterfs/bricks/brick1.log:35591:The message "I [MSGID: 113030] [posix-entry-ops.c:1401:posix_unlink] 0-issue-posix: open-fd-key-status: 1 for /brick1/9" repeated 2 times between [2025-06-18 07:33:36.604831 +0000] and [2025-06-18 07:33:36.608125 +0000]
/var/log/glusterfs/bricks/brick2.log:34740:[2025-06-18 07:33:36.604900 +0000] I [MSGID: 113030] [posix-entry-ops.c:1401:posix_unlink] 0-issue-posix: open-fd-key-status: 1 for /brick2/9 
/var/log/glusterfs/bricks/brick2.log:34744:The message "I [MSGID: 113030] [posix-entry-ops.c:1401:posix_unlink] 0-issue-posix: open-fd-key-status: 1 for /brick2/9" repeated 2 times between [2025-06-18 07:33:36.604900 +0000] and [2025-06-18 07:33:36.608189 +0000]

@chen1195585098
Copy link
Contributor Author

chen1195585098 commented Jun 18, 2025

@pranithk @mohit84 Hi, can you have a look at this? Thanks.

@pranithk pranithk self-assigned this Jun 24, 2025
@rafikc30 rafikc30 self-requested a review July 7, 2025 09:51
Copy link
Member

@rafikc30 rafikc30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sending the change. This is an important PR that fixes a critical issue; however, I have a comment.

fromvol->fops->mknod, loc, S_IFREG | DHT_LINKFILE_MODE, 0,
0, dict);
fromvol->fops->create, loc, local->flags,
S_IFREG | DHT_LINKFILE_MODE, 0, local->fd, dict);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use create instead of mknod, we need to ensure that a valid fd_t is provided, and that it is properly released afterward (i.e., the fd's reference count should drop to zero to trigger cleanup).

While a simple create call works with your change, there are several places where we call dht_linkfile_create—including in dht_rename (for creating the linkfile during rename), dht_lookup_everywhere, etc.—without a valid local->fd, since these are not fd-based operations.

I believe this patch will cause such operations to fail or even crash. So we need to make sure an fd is created temporarily in such cases and unref it properly after creating the linkfile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your careful review @rafikc30 .
Yes, we should make sure a valid fd is present when call fops->create, otherwise the mount process will get crashed since later opreations requires fd->lock, like this:
image

I will handle these cases in next submit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafikc30 I have seen special handling in posix_mknod() for handling races in dht. Should they be present in posix_create() also to be safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pranithk You are right, it is best to have the xattr key GF_FORCE_REPLACE_KEY honoured in posix_create as well. Aslo, I think we should tackle the case where a linkfile is created to an existing hardlink with linkfile. With the current implementation this will result in different inode number for gfid file and linkfile. So I think we should also implemnt this check in posix_create.

@chen1195585098 chen1195585098 force-pushed the fix_linkfile_gfid_mismatch branch from daec561 to adb86e8 Compare July 9, 2025 10:17
@gluster-ant
Copy link
Collaborator

CLANG-FORMAT FAILURE:
Before merging the patch, this diff needs to be considered for passing clang-format

index 8da3b107f..a770f8af3 100644
--- a/xlators/cluster/dht/src/dht-common.h
+++ b/xlators/cluster/dht/src/dht-common.h
@@ -729,14 +729,13 @@ typedef struct dht_fd_ctx {
         int32_t __tmp = -1;                                                    \
         if (!local || !local->fd || !local->params)                            \
             break;                                                             \
-        __tmp = dict_get_int32_sizen(local->params,                            \
-                                     "dht-unref-tmp-fd",                       \
+        __tmp = dict_get_int32_sizen(local->params, "dht-unref-tmp-fd",        \
                                      &__need_unref);                           \
         if (!__tmp && __need_unref) {                                          \
             fd_unref(local->fd);                                               \
             dict_del(local->params, "dht-unref-tmp-fd");                       \
         }                                                                      \
-    } while(0)
+    } while (0)
 
 dht_layout_t *
 dht_layout_new(xlator_t *this, int cnt);

local->fd = fd_create(loc->inode, frame->root->pid);
local->flags |= O_CREAT;

fd_bind(local->fd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually fd_bind is done after we successfully open the fd. In this case, I don't think we need to do fd_bind at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually fd_bind is done after we successfully open the fd. In this case, I don't think we need to do fd_bind at all.

Since we are creating a linkfile, upon successful creation, we call fd_bind to increment fd_count, as the file descriptor was opened during posix_create. This ensures that the linkfile is not considered stale by other processes. If the linkfile creation fails, the temporary file descriptor will be unreferenced in linkfile.linkfile_cbk. Since we are the only one holding a reference to this fd, it will be destroyed once we call fd_unref, and fd_count will be decremented accordingly.

Are there any potential negative side effects of using fd_bind in this context?

Copy link
Contributor Author

@chen1195585098 chen1195585098 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually fd_bind is done after we successfully open the fd. In this case, I don't think we need to do fd_bind at all.

fd_bind is necessary to keep the newly creating linkfile from being misjudged and deleted according to my tests. During posix_unlink, there is no more indicator to check if this linkfile is being used other than fd_count, so we do need fd_bind to increase fd->inode->fd_count. Otherwise, this linkfile will still be deleted even if it was successfully opened. Is this right? @pranithk @rafikc30

@@ -2003,6 +2003,7 @@ dht_lookup_linkfile_create_cbk(call_frame_t *frame, void *cooie, xlator_t *this,
cached_subvol = local->cached_subvol;
conf = this->private;

DHT_CHECK_AND_UNREF_TMP_FD(local);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just leave the fd on local->fd and later the dht_local_wipe should take care of unreffing it. I can't think of any issues keeping the fd in local. Did you see any issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just leave the fd on local->fd and later the dht_local_wipe should take care of unreffing it. I can't think of any issues keeping the fd in local. Did you see any issue?

Yes, I see EBADFD errors in client_pre_xxx_v2 if I do not unref tmp fd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just leave the fd on local->fd and later the dht_local_wipe should take care of unreffing it. I can't think of any issues keeping the fd in local. Did you see any issue?

clearing local->fd in dht_local_wipe is more reasonable, I need some time to investigate the cause of EBADFD, maybe it's related to my testing methods.

@chen1195585098 chen1195585098 force-pushed the fix_linkfile_gfid_mismatch branch from adb86e8 to 66cec34 Compare July 12, 2025 04:10
@gluster-ant
Copy link
Collaborator

CLANG-FORMAT FAILURE:
Before merging the patch, this diff needs to be considered for passing clang-format

index 691886e31..5af622887 100644
--- a/xlators/cluster/dht/src/dht-common.c
+++ b/xlators/cluster/dht/src/dht-common.c
@@ -2427,16 +2427,18 @@ dht_lookup_everywhere_done(call_frame_t *frame, xlator_t *this)
                              "No Cached was found and "
                              "unlink on hashed was skipped"
                              " so performing now: %s. fd_count = %d",
-                             local->loc.path, local->skip_unlink.opend_fd_count);
+                             local->loc.path,
+                             local->skip_unlink.opend_fd_count);
                 if (local->skip_unlink.opend_fd_count == 0) {
                     FRAME_SU_DO(frame, dht_local_t);
                     STACK_WIND(frame, dht_lookup_unlink_stale_linkto_cbk,
                                hashed_subvol, hashed_subvol->fops->unlink,
                                &local->loc, 0, local->xattr_req);
                 } else {
-                    /* Skip linkfile deletion, this linkfile may be in used bacause its fd_count > 0 */
-                    DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL, NULL,
-                                     NULL);
+                    /* Skip linkfile deletion, this linkfile may be in used
+                     * bacause its fd_count > 0 */
+                    DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL,
+                                     NULL, NULL);
                 }
             }
 

@chen1195585098 chen1195585098 force-pushed the fix_linkfile_gfid_mismatch branch from 66cec34 to a88fdd3 Compare July 13, 2025 09:18
@gluster-ant
Copy link
Collaborator

CLANG-FORMAT FAILURE:
Before merging the patch, this diff needs to be considered for passing clang-format

index 561f9168d..359d5445d 100644
--- a/xlators/cluster/dht/src/dht-common.c
+++ b/xlators/cluster/dht/src/dht-common.c
@@ -2437,8 +2437,8 @@ dht_lookup_everywhere_done(call_frame_t *frame, xlator_t *this)
                 } else {
                     /* Skip linkfile deletion, this linkfile may be in used
                      * because its fd_count > 0 */
-                    DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL, NULL,
-                                     NULL);
+                    DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL,
+                                     NULL, NULL);
                 }
             }
 

A newly created linkfile will be deleted for the race, after
this, the gfid between linkfile and datafile may get
mismatched. And shards deletion process is also affected
by such a mismatch, most shards in BRICK_PATH/.shard
cannot be removed correctly and space cannot be released.

To fix this race, replace mknod with create during linkfile
creation, in this case, we can keep this linkfile opening
during the whole linkfile and datafile creations.

A linkfile will not be regarded as stale and deleted then,
because its fd is opened and posix_unlink can handle it
properly.

In this patch, the linkfile and datafile creation process is as below:

    dht                       hashed_subvol                cached_subvol
--> |                              |                              |
    |                              |                              |
    | a.create tmp fd as local->fd |                              |
    | b.send fops->create          |                              |
    |---- dht_linkfile_create ---->|                              |
    |                              |                              |
    | c.succeed to create linkfile,|                              |
    |<-- dht_linkfile_create_cbk --|                              |
    |                                                             |
    |--- corresponding fop for datafile, mknod, create, etc.  --->|
    |                                                             |
    |<----------            datafile fop cbk            ----------|
    |                                                             |
    |                              |                              |
    | d.stack unwind/destroy       |                              |
    | e.clear dht_local_t in       |                              |
<-- |   dht_local_wipe             |                              |
    |                              |                              |

A tmp local->fd will be created for non-fd-based fops to ensure
safe access to fd or its member in subsequent procedures. Once this
linkfile was successfully created, its fd_count will be increased in
server cbk and passed to client if necessary. Then, between phase c
and phase e, linkfile will not be deleted by mistake since later
processes will know this linkfile is being used, instead of stale.
After phase e, this linkfile links to a datafile, so it is valid.

This change should also be applied during dht_lookup_everywhere_cbk.

Fixes: gluster#4548
Signed-off-by: chenjinhao <[email protected]>
@chen1195585098 chen1195585098 force-pushed the fix_linkfile_gfid_mismatch branch from a88fdd3 to c0bac6b Compare July 13, 2025 10:01
@gluster-ant
Copy link
Collaborator

CLANG-FORMAT FAILURE:
Before merging the patch, this diff needs to be considered for passing clang-format

index 425d48614..9b48b6d41 100644
--- a/xlators/storage/posix/src/posix-entry-ops.c
+++ b/xlators/storage/posix/src/posix-entry-ops.c
@@ -2446,7 +2446,8 @@ real_op:
         if (op_errno == EEXIST) {
             if (dict_get_sizen(xdata, GF_FORCE_REPLACE_KEY)) {
                 dict_del_sizen(xdata, GF_FORCE_REPLACE_KEY);
-                op_ret = posix_unlink_stale_linkto(frame, this, real_path, &op_errno, loc);
+                op_ret = posix_unlink_stale_linkto(frame, this, real_path,
+                                                   &op_errno, loc);
 
                 if (op_ret == 0)
                     goto real_op;

@pranithk
Copy link
Member

@chen1195585098 Hey, sorry for the delay in review from my side. We are still working on the aftermath of this bug in our prod. Mostly mid next week we will be done with it. After that I will get time to look into this PR.

@chen1195585098
Copy link
Contributor Author

@chen1195585098 Hey, sorry for the delay in review from my side. We are still working on the aftermath of this bug in our prod. Mostly mid next week we will be done with it. After that I will get time to look into this PR.

It is okay. May everything caused by this bug will be addressed well :)

@chen1195585098
Copy link
Contributor Author

Hi @pranithk @pranithk , hope this message finds you all well. Is everything going as expected?

This patch has been pending for a while now, and it’s pretty crucial for our production environment. If you could review this patch when you get a chance, that'd be awesome.

Thanks a lot!

_fd = sys_open(real_path, _flags, mode);

if (_fd == -1) {
op_errno = errno;
op_ret = -1;
if (op_errno == EEXIST) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also handle the case for linkfile create request to a hardlink, similar to what we do in posix_mknod with posix_create_link_if_gfid_exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also handle the case for linkfile create request to a hardlink, similar to what we do in posix_mknod with posix_create_link_if_gfid_exists

Done

@chen1195585098 chen1195585098 requested a review from rafikc30 August 6, 2025 08:44
@rafikc30
Copy link
Member

/run full-regression

rafikc30
rafikc30 previously approved these changes Aug 19, 2025
Copy link
Member

@rafikc30 rafikc30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you had a chance to run full regression with this patch? If not, I can also try running the regression tests on my machine.

@chen1195585098
Copy link
Contributor Author

Did you had a chance to run full regression with this patch? If not, I can also try running the regression tests on my machine.

Haven't run a full-regression test yet.

@chen1195585098
Copy link
Contributor Author

Did you had a chance to run full regression with this patch? If not, I can also try running the regression tests on my machine.

@rafikc30 I’m currently running the tests on my VM, but it’s pretty slow. If you happen to have a spare machine with better performance, could we run them there instead?

chenjinhao added 2 commits August 23, 2025 09:12
Since we replace linkfile creation call from posix_mknod to
posix_create, we should also apply this patch, to ensure
the value of trusted.pgfid.xxx is correct.

Original patch:
https://review.gluster.org/#/c/glusterfs/+/20875/

Signed-off-by: chenjinhao <[email protected]>
Make sure _fd that passed to posix_fdstat is always valid
even if posix_create_link_if_gfid_exists is called.

Signed-off-by: chenjinhao <[email protected]>
@chen1195585098
Copy link
Contributor Author

chen1195585098 commented Aug 23, 2025

There are 2 more commits aim to fix issues that found during full-regression tests between devel branch and this patch branch.

Tesing with this latest patch, now, there is no new full-regression test failure compared with devel branch.

So, can you review this pr again when you get a time. Thanks :) @pranithk @rafikc30 @amarts

TEST RESULT:
There are 840 tests in total, except for some bad tests that are skiped, 162 tests get failure on my test vm finally in both branches. And now here is no difference between the tests result of two branches.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

concurrent creates when min-free-disk is exceeded can lead to gfid mismatches between linkto and data files
4 participants