Improve logging of rebalancer and recovery #586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mrForza wants to merge 5 commits into tarantool:master from mrForza:mrforza/gh-212-improvement-of-rebalancer-logging

+189 −20

Contributor

mrForza commented Aug 12, 2025

Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered, it led
to unnecessary log entries. This patch fixes the issue by adding an
additional check for the number of recovered buckets.

Closes #212

NO_DOC=bugfix

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 6b1057d to 64cc837 Compare

August 15, 2025 09:27

mrForza requested a review from Serpentian

August 15, 2025 09:42

mrForza assigned Serpentian

Serpentian reviewed

View reviewed changes

Collaborator

Serpentian left a comment

These are the comments for the first two commits, more comments are coming later) Thank you for working on this, good logging is crucial and allows us to investigate, what happened during incidents

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

Serpentian reviewed

View reviewed changes

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 3 times, most recently from 5a8b3f8 to f5c25f7 Compare

August 22, 2025 15:52

mrForza assigned Serpentian and unassigned mrForza

mrForza requested a review from Serpentian

August 23, 2025 13:17

Serpentian reviewed

View reviewed changes

Collaborator

Serpentian left a comment

Oh, shit. I forgot to send the last message of review, I'm very sorry

vshard/storage/init.lua Show resolved Hide resolved

Serpentian reviewed

View reviewed changes

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from 04c506f to ccff54f Compare

September 10, 2025 07:47

mrForza assigned Serpentian and unassigned mrForza

mrForza requested a review from Serpentian

September 10, 2025 08:07

Serpentian requested changes

View reviewed changes

vshard/storage/init.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Outdated Show resolved Hide resolved

test/storage-luatest/storage_1_1_1_test.lua Outdated Show resolved Hide resolved

vshard/storage/init.lua Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 3 times, most recently from 46add65 to a1c095b Compare

September 17, 2025 13:22

mrForza assigned Serpentian and unassigned mrForza

Serpentian removed their assignment

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from 489b425 to fce8f28 Compare

October 9, 2025 09:21

mrForza removed their assignment

mrForza requested a review from Serpentian

October 10, 2025 12:40

mrForza assigned Serpentian

Serpentian reviewed

View reviewed changes

Collaborator

Serpentian left a comment

Sorry, but this one is critical, overlooked it somehow on the prev iterations(

vshard/storage/init.lua Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from fce8f28 to dcfa98a Compare

October 14, 2025 14:18

mrForza requested a review from Serpentian

October 14, 2025 14:39

mrForza assigned Serpentian and unassigned mrForza

Serpentian reviewed

View reviewed changes

test/storage-luatest/storage_1_1_1_test.lua Show resolved Hide resolved

vshard/storage/init.lua Outdated

    
                          {timeout = consts.REBALANCER_GET_STATE_TIMEOUT})

                      if state == nil then

                          return

                          return nil, replicaset.id

Collaborator

Serpentian Oct 20, 2025

This should be carefully rebased, now error is returned from the rebalancer_download_states. I'd propose to introduce the new error code, which will say, that the storage doesn't have all bucket in proper state, and set err there if it's nil (it'll happen, in case the function really returned nil).

In that err (or any other there) we'll just set the err.replicaset, smth like that:

vshard/vshard/router/init.lua

Line 1653 in b4adaea

err.replicaset = rs_id

And we'll print the whole err, not just the replicaset's id

Contributor Author

mrForza Oct 31, 2025

fixed

Serpentian assigned mrForza and unassigned Serpentian

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from dcfa98a to d0922ba Compare

October 31, 2025 12:45

mrForza added 3 commits

October 31, 2025 16:05


          recovery: reduce spam of "Finish bucket recovery step" logs

0c2d5db

Before this patch "Finish bucket recovery step ..." logs were printed at
the end of recovery even if no buckets were successfully recovered. It led
to unnecessary log records. This patch fixes the issue by adding an
additional check for the number of recovered buckets.

Part of tarantool#212

NO_DOC=bugfix


          recovery: add logging of recovered buckets

This patch introduces logging of buckets' ids which were recovered
during recovery stage of storage.

Part of tarantool#212

NO_DOC=bugfix


          rebalancer: add logging of routes

06c15b0

This patch adds rebalancer routes' logging. The log file now
includes information about the source storage, the number of
buckets, and the destination storage where the buckets will
be moved.

Since the rebalancer service has changed logging of routes that
were sent, we change the `rebalancer/rebalancer.test.lua` and
`rebalancer/stress_add_remove_several_rs.test.lua` tests.

Part of tarantool#212

NO_DOC=bugfix

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch 2 times, most recently from 63843af to bb8b109 Compare

October 31, 2025 13:36

mrForza assigned Serpentian and unassigned mrForza

mrForza requested a review from Serpentian

October 31, 2025 13:54

Serpentian reviewed

View reviewed changes

vshard/storage/init.lua Outdated

    
                          {timeout = consts.REBALANCER_GET_STATE_TIMEOUT})

                      if state == nil then

                          return nil, err

                          return lerror.vshard(lerror.code.BUCKETS_NOT_IN_PROPER_STATE,

Collaborator

Serpentian Nov 6, 2025

Nope, this way we're losing the original error. If the result is nil and error is nil, then the function was executed properly, but it just have some buckets in the SENDING, RECEIVING or GARBAGE state. Only in that case the new error should be thrown. If e.g. timeout happened, this error should not be printed.

But let's go even further and make the rebalancer_request_state function consistent with others: so that it returns either result or nil,err: e.g. introduce the BUCKET_INVALID_STATE: Replica <id> has <state> buckets and return it from there, this way we'll also know, which buckets are still there.

And in that if statement we wanna just do err.replicaset = <replicaset_id> and that's it. We won't care, which error occured: e.g. timeout, NON_MASTER or BUCKET_INVALID_STATE

Contributor Author

mrForza Nov 7, 2025

fixed

Serpentian assigned mrForza and unassigned Serpentian

mrForza added 2 commits

November 7, 2025 19:50


          rebalancer: refactoring of rebalancer_request_state

c7c8376

Before this patch the function `rebalancer_request_state` returned only
nil in case of errors (e.g. presence of SENDING, RECEIVING, GARBAGE
buckets, not active rebalancer). This makes it inconsistent compared to
other rebalancer functions.

Now, in case of errors we return `(nil, err)` instead of `nil`. It can
help us to propagate a meaningful error in rebalancer service.

NO_TEST=refactoring
NO_DOC=refactoring


          rebalancer: add replicaset.id in "Some buckets are not active" log

a3b9651

Before this patch the function `rebalancer_download_states` didn't
return information about replicaset from which the states could not
be downloaded. As a result, the log "Some buckets are not active
..." lacks of valuable information about unhealthy replicaset.

Now, we add replicaset.id in error which is returned from
`rebalancer_download_states`. It can help us to propagate replicaset.id
to "Some buckets are not active ..." and "Error during downloading
rebalancer states ..." log.

Also we change `rebalancer/rebalancer.test.lua` test which expected
the old "Some buckets are not active" log without replicaset.id.

Closes tarantool#212

NO_DOC=bugfix

mrForza force-pushed the mrforza/gh-212-improvement-of-rebalancer-logging branch from bb8b109 to a3b9651 Compare

November 7, 2025 19:18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet