-
Notifications
You must be signed in to change notification settings - Fork 1.1k
linkfile: use fops->create in dht_linkfile_create #4585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
linkfile: use fops->create in dht_linkfile_create #4585
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
During linkfile and datafile creations, the linkfile is kept opening.
And a newly created linkfile will not be deleted because its fd is opend and flag
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sending the change. This is an important PR that fixes a critical issue; however, I have a comment.
fromvol->fops->mknod, loc, S_IFREG | DHT_LINKFILE_MODE, 0, | ||
0, dict); | ||
fromvol->fops->create, loc, local->flags, | ||
S_IFREG | DHT_LINKFILE_MODE, 0, local->fd, dict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use create instead of mknod, we need to ensure that a valid fd_t is provided, and that it is properly released afterward (i.e., the fd's reference count should drop to zero to trigger cleanup).
While a simple create call works with your change, there are several places where we call dht_linkfile_create—including in dht_rename (for creating the linkfile during rename), dht_lookup_everywhere, etc.—without a valid local->fd, since these are not fd-based operations.
I believe this patch will cause such operations to fail or even crash. So we need to make sure an fd is created temporarily in such cases and unref it properly after creating the linkfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your careful review @rafikc30 .
Yes, we should make sure a valid fd is present when call fops->create, otherwise the mount process will get crashed since later opreations requires fd->lock
, like this:
I will handle these cases in next submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rafikc30 I have seen special handling in posix_mknod() for handling races in dht. Should they be present in posix_create() also to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pranithk You are right, it is best to have the xattr key GF_FORCE_REPLACE_KEY honoured in posix_create as well. Aslo, I think we should tackle the case where a linkfile is created to an existing hardlink with linkfile. With the current implementation this will result in different inode number for gfid file and linkfile. So I think we should also implemnt this check in posix_create.
daec561
to
adb86e8
Compare
CLANG-FORMAT FAILURE: index 8da3b107f..a770f8af3 100644
--- a/xlators/cluster/dht/src/dht-common.h
+++ b/xlators/cluster/dht/src/dht-common.h
@@ -729,14 +729,13 @@ typedef struct dht_fd_ctx {
int32_t __tmp = -1; \
if (!local || !local->fd || !local->params) \
break; \
- __tmp = dict_get_int32_sizen(local->params, \
- "dht-unref-tmp-fd", \
+ __tmp = dict_get_int32_sizen(local->params, "dht-unref-tmp-fd", \
&__need_unref); \
if (!__tmp && __need_unref) { \
fd_unref(local->fd); \
dict_del(local->params, "dht-unref-tmp-fd"); \
} \
- } while(0)
+ } while (0)
dht_layout_t *
dht_layout_new(xlator_t *this, int cnt); |
local->fd = fd_create(loc->inode, frame->root->pid); | ||
local->flags |= O_CREAT; | ||
|
||
fd_bind(local->fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually fd_bind is done after we successfully open the fd. In this case, I don't think we need to do fd_bind at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually fd_bind is done after we successfully open the fd. In this case, I don't think we need to do fd_bind at all.
Since we are creating a linkfile, upon successful creation, we call fd_bind to increment fd_count, as the file descriptor was opened during posix_create. This ensures that the linkfile is not considered stale by other processes. If the linkfile creation fails, the temporary file descriptor will be unreferenced in linkfile.linkfile_cbk. Since we are the only one holding a reference to this fd, it will be destroyed once we call fd_unref, and fd_count will be decremented accordingly.
Are there any potential negative side effects of using fd_bind in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually fd_bind is done after we successfully open the fd. In this case, I don't think we need to do fd_bind at all.
fd_bind is necessary to keep the newly creating linkfile from being misjudged and deleted according to my tests. During posix_unlink, there is no more indicator to check if this linkfile is being used other than fd_count, so we do need fd_bind to increase fd->inode->fd_count. Otherwise, this linkfile will still be deleted even if it was successfully opened. Is this right? @pranithk @rafikc30
xlators/cluster/dht/src/dht-common.c
Outdated
@@ -2003,6 +2003,7 @@ dht_lookup_linkfile_create_cbk(call_frame_t *frame, void *cooie, xlator_t *this, | |||
cached_subvol = local->cached_subvol; | |||
conf = this->private; | |||
|
|||
DHT_CHECK_AND_UNREF_TMP_FD(local); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just leave the fd on local->fd and later the dht_local_wipe should take care of unreffing it. I can't think of any issues keeping the fd in local. Did you see any issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just leave the fd on local->fd and later the dht_local_wipe should take care of unreffing it. I can't think of any issues keeping the fd in local. Did you see any issue?
Yes, I see EBADFD
errors in client_pre_xxx_v2
if I do not unref tmp fd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just leave the fd on local->fd and later the dht_local_wipe should take care of unreffing it. I can't think of any issues keeping the fd in local. Did you see any issue?
clearing local->fd in dht_local_wipe is more reasonable, I need some time to investigate the cause of EBADFD, maybe it's related to my testing methods.
adb86e8
to
66cec34
Compare
CLANG-FORMAT FAILURE: index 691886e31..5af622887 100644
--- a/xlators/cluster/dht/src/dht-common.c
+++ b/xlators/cluster/dht/src/dht-common.c
@@ -2427,16 +2427,18 @@ dht_lookup_everywhere_done(call_frame_t *frame, xlator_t *this)
"No Cached was found and "
"unlink on hashed was skipped"
" so performing now: %s. fd_count = %d",
- local->loc.path, local->skip_unlink.opend_fd_count);
+ local->loc.path,
+ local->skip_unlink.opend_fd_count);
if (local->skip_unlink.opend_fd_count == 0) {
FRAME_SU_DO(frame, dht_local_t);
STACK_WIND(frame, dht_lookup_unlink_stale_linkto_cbk,
hashed_subvol, hashed_subvol->fops->unlink,
&local->loc, 0, local->xattr_req);
} else {
- /* Skip linkfile deletion, this linkfile may be in used bacause its fd_count > 0 */
- DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL, NULL,
- NULL);
+ /* Skip linkfile deletion, this linkfile may be in used
+ * bacause its fd_count > 0 */
+ DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL,
+ NULL, NULL);
}
}
|
66cec34
to
a88fdd3
Compare
CLANG-FORMAT FAILURE: index 561f9168d..359d5445d 100644
--- a/xlators/cluster/dht/src/dht-common.c
+++ b/xlators/cluster/dht/src/dht-common.c
@@ -2437,8 +2437,8 @@ dht_lookup_everywhere_done(call_frame_t *frame, xlator_t *this)
} else {
/* Skip linkfile deletion, this linkfile may be in used
* because its fd_count > 0 */
- DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL, NULL,
- NULL);
+ DHT_STACK_UNWIND(lookup, frame, -1, ENOENT, NULL, NULL,
+ NULL, NULL);
}
}
|
A newly created linkfile will be deleted for the race, after this, the gfid between linkfile and datafile may get mismatched. And shards deletion process is also affected by such a mismatch, most shards in BRICK_PATH/.shard cannot be removed correctly and space cannot be released. To fix this race, replace mknod with create during linkfile creation, in this case, we can keep this linkfile opening during the whole linkfile and datafile creations. A linkfile will not be regarded as stale and deleted then, because its fd is opened and posix_unlink can handle it properly. In this patch, the linkfile and datafile creation process is as below: dht hashed_subvol cached_subvol --> | | | | | | | a.create tmp fd as local->fd | | | b.send fops->create | | |---- dht_linkfile_create ---->| | | | | | c.succeed to create linkfile,| | |<-- dht_linkfile_create_cbk --| | | | |--- corresponding fop for datafile, mknod, create, etc. --->| | | |<---------- datafile fop cbk ----------| | | | | | | d.stack unwind/destroy | | | e.clear dht_local_t in | | <-- | dht_local_wipe | | | | | A tmp local->fd will be created for non-fd-based fops to ensure safe access to fd or its member in subsequent procedures. Once this linkfile was successfully created, its fd_count will be increased in server cbk and passed to client if necessary. Then, between phase c and phase e, linkfile will not be deleted by mistake since later processes will know this linkfile is being used, instead of stale. After phase e, this linkfile links to a datafile, so it is valid. This change should also be applied during dht_lookup_everywhere_cbk. Fixes: gluster#4548 Signed-off-by: chenjinhao <[email protected]>
a88fdd3
to
c0bac6b
Compare
CLANG-FORMAT FAILURE: index 425d48614..9b48b6d41 100644
--- a/xlators/storage/posix/src/posix-entry-ops.c
+++ b/xlators/storage/posix/src/posix-entry-ops.c
@@ -2446,7 +2446,8 @@ real_op:
if (op_errno == EEXIST) {
if (dict_get_sizen(xdata, GF_FORCE_REPLACE_KEY)) {
dict_del_sizen(xdata, GF_FORCE_REPLACE_KEY);
- op_ret = posix_unlink_stale_linkto(frame, this, real_path, &op_errno, loc);
+ op_ret = posix_unlink_stale_linkto(frame, this, real_path,
+ &op_errno, loc);
if (op_ret == 0)
goto real_op; |
@chen1195585098 Hey, sorry for the delay in review from my side. We are still working on the aftermath of this bug in our prod. Mostly mid next week we will be done with it. After that I will get time to look into this PR. |
It is okay. May everything caused by this bug will be addressed well :) |
_fd = sys_open(real_path, _flags, mode); | ||
|
||
if (_fd == -1) { | ||
op_errno = errno; | ||
op_ret = -1; | ||
if (op_errno == EEXIST) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also handle the case for linkfile create request to a hardlink, similar to what we do in posix_mknod with posix_create_link_if_gfid_exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also handle the case for linkfile create request to a hardlink, similar to what we do in posix_mknod with posix_create_link_if_gfid_exists
Done
Signed-off-by: chenjinhao <[email protected]>
/run full-regression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you had a chance to run full regression with this patch? If not, I can also try running the regression tests on my machine.
Haven't run a full-regression test yet. |
@rafikc30 I’m currently running the tests on my VM, but it’s pretty slow. If you happen to have a spare machine with better performance, could we run them there instead? |
Since we replace linkfile creation call from posix_mknod to posix_create, we should also apply this patch, to ensure the value of trusted.pgfid.xxx is correct. Original patch: https://review.gluster.org/#/c/glusterfs/+/20875/ Signed-off-by: chenjinhao <[email protected]>
Make sure _fd that passed to posix_fdstat is always valid even if posix_create_link_if_gfid_exists is called. Signed-off-by: chenjinhao <[email protected]>
There are 2 more commits aim to fix issues that found during full-regression tests between devel branch and this patch branch. |
A newly created linkfile will be deleted for the race, after this, the gfid between linkfile and datafile may get mismatched. And shards deletion process is also affected by such a mismatch, most shards in BRICK_PATH/.shard cannot be removed correctly and space cannot be release.
To fix this race, replace mknod with create during linkfile creation, in this case, we can keep this linkfile opening during the whole linkfile and datafile creations.
A linkfile will not be regarded as stale and deleted then, because its fd is opened and posix_unlink can handle it properly.
Fixes: #4548