-
Notifications
You must be signed in to change notification settings - Fork 34
Improve logging of rebalancer and recovery #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve logging of rebalancer and recovery #586
Conversation
6b1057d to
64cc837
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the comments for the first two commits, more comments are coming later) Thank you for working on this, good logging is crucial and allows us to investigate, what happened during incidents
5a8b3f8 to
f5c25f7
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, shit. I forgot to send the last message of review, I'm very sorry
04c506f to
ccff54f
Compare
46add65 to
a1c095b
Compare
489b425 to
fce8f28
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but this one is critical, overlooked it somehow on the prev iterations(
fce8f28 to
dcfa98a
Compare
vshard/storage/init.lua
Outdated
| {timeout = consts.REBALANCER_GET_STATE_TIMEOUT}) | ||
| if state == nil then | ||
| return | ||
| return nil, replicaset.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be carefully rebased, now error is returned from the rebalancer_download_states. I'd propose to introduce the new error code, which will say, that the storage doesn't have all bucket in proper state, and set err there if it's nil (it'll happen, in case the function really returned nil).
In that err (or any other there) we'll just set the err.replicaset, smth like that:
Line 1653 in b4adaea
| err.replicaset = rs_id |
And we'll print the whole err, not just the replicaset's id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
dcfa98a to
d0922ba
Compare
Before this patch "Finish bucket recovery step ..." logs were printed at the end of recovery even if no buckets were successfully recovered. It led to unnecessary log records. This patch fixes the issue by adding an additional check for the number of recovered buckets. Part of tarantool#212 NO_DOC=bugfix
This patch introduces logging of buckets' ids which were recovered during recovery stage of storage. Part of tarantool#212 NO_DOC=bugfix
This patch adds rebalancer routes' logging. The log file now includes information about the source storage, the number of buckets, and the destination storage where the buckets will be moved. Since the rebalancer service has changed logging of routes that were sent, we change the `rebalancer/rebalancer.test.lua` and `rebalancer/stress_add_remove_several_rs.test.lua` tests. Part of tarantool#212 NO_DOC=bugfix
63843af to
bb8b109
Compare
vshard/storage/init.lua
Outdated
| {timeout = consts.REBALANCER_GET_STATE_TIMEOUT}) | ||
| if state == nil then | ||
| return nil, err | ||
| return lerror.vshard(lerror.code.BUCKETS_NOT_IN_PROPER_STATE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this way we're losing the original error. If the result is nil and error is nil, then the function was executed properly, but it just have some buckets in the SENDING, RECEIVING or GARBAGE state. Only in that case the new error should be thrown. If e.g. timeout happened, this error should not be printed.
But let's go even further and make the rebalancer_request_state function consistent with others: so that it returns either result or nil,err: e.g. introduce the BUCKET_INVALID_STATE: Replica <id> has <state> buckets and return it from there, this way we'll also know, which buckets are still there.
And in that if statement we wanna just do err.replicaset = <replicaset_id> and that's it. We won't care, which error occured: e.g. timeout, NON_MASTER or BUCKET_INVALID_STATE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Before this patch the function `rebalancer_request_state` returned only nil in case of errors (e.g. presence of SENDING, RECEIVING, GARBAGE buckets, not active rebalancer). This makes it inconsistent compared to other rebalancer functions. Now, in case of errors we return `(nil, err)` instead of `nil`. It can help us to propagate a meaningful error in rebalancer service. NO_TEST=refactoring NO_DOC=refactoring
Before this patch the function `rebalancer_download_states` didn't return information about replicaset from which the states could not be downloaded. As a result, the log "Some buckets are not active ..." lacks of valuable information about unhealthy replicaset. Now, we add replicaset.id in error which is returned from `rebalancer_download_states`. It can help us to propagate replicaset.id to "Some buckets are not active ..." and "Error during downloading rebalancer states ..." log. Also we change `rebalancer/rebalancer.test.lua` test which expected the old "Some buckets are not active" log without replicaset.id. Closes tarantool#212 NO_DOC=bugfix
bb8b109 to
a3b9651
Compare
Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered, it led
to unnecessary log entries. This patch fixes the issue by adding an
additional check for the number of recovered buckets.
Closes #212
NO_DOC=bugfix