Skip to content

Commit 2eb17c4

Browse files
committed
router: backoff on storage being disabled
If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried.
1 parent 723fcff commit 2eb17c4

File tree

3 files changed

+145
-2
lines changed

3 files changed

+145
-2
lines changed

test/router/router2.result

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -570,6 +570,98 @@ assert(not ok and err.message:match('Unknown mode') ~= nil)
570570
| - true
571571
| ...
572572

573+
--
574+
-- Storage is disabled = backoff.
575+
--
576+
test_run:switch('storage_2_a')
577+
| ---
578+
| - true
579+
| ...
580+
vshard.storage.disable()
581+
| ---
582+
| ...
583+
584+
test_run:switch('router_1')
585+
| ---
586+
| - true
587+
| ...
588+
-- Drop old backoffs.
589+
fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
590+
| ---
591+
| ...
592+
-- Success, but internally the request was retried.
593+
res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
594+
| ---
595+
| ...
596+
assert(res == 100)
597+
| ---
598+
| - true
599+
| ...
600+
-- The best replica entered backoff state.
601+
util = require('util')
602+
| ---
603+
| ...
604+
storage_2 = vshard.router.static.replicasets[replicasets[2]]
605+
| ---
606+
| ...
607+
storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a]
608+
| ---
609+
| ...
610+
assert(storage_2_a.backoff_ts ~= nil)
611+
| ---
612+
| - true
613+
| ...
614+
615+
test_run:switch('storage_2_b')
616+
| ---
617+
| - true
618+
| ...
619+
assert(echo_count == 1)
620+
| ---
621+
| - true
622+
| ...
623+
echo_count = 0
624+
| ---
625+
| ...
626+
627+
test_run:switch('storage_2_a')
628+
| ---
629+
| - true
630+
| ...
631+
assert(echo_count == 0)
632+
| ---
633+
| - true
634+
| ...
635+
vshard.storage.enable()
636+
| ---
637+
| ...
638+
639+
test_run:switch('router_1')
640+
| ---
641+
| - true
642+
| ...
643+
-- Drop the backoff.
644+
fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
645+
| ---
646+
| ...
647+
-- Now goes to the best replica - it is enabled again.
648+
res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
649+
| ---
650+
| ...
651+
assert(res == 100)
652+
| ---
653+
| - true
654+
| ...
655+
656+
test_run:switch('storage_2_a')
657+
| ---
658+
| - true
659+
| ...
660+
assert(echo_count == 1)
661+
| ---
662+
| - true
663+
| ...
664+
573665
_ = test_run:switch("default")
574666
| ---
575667
| ...

test/router/router2.test.lua

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,42 @@ ok, err = rs:callro('vshard.storage.call', {1, 'badmode', 'echo', {100}},
226226
long_timeout)
227227
assert(not ok and err.message:match('Unknown mode') ~= nil)
228228

229+
--
230+
-- Storage is disabled = backoff.
231+
--
232+
test_run:switch('storage_2_a')
233+
vshard.storage.disable()
234+
235+
test_run:switch('router_1')
236+
-- Drop old backoffs.
237+
fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
238+
-- Success, but internally the request was retried.
239+
res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
240+
assert(res == 100)
241+
-- The best replica entered backoff state.
242+
util = require('util')
243+
storage_2 = vshard.router.static.replicasets[replicasets[2]]
244+
storage_2_a = storage_2.replicas[util.name_to_uuid.storage_2_a]
245+
assert(storage_2_a.backoff_ts ~= nil)
246+
247+
test_run:switch('storage_2_b')
248+
assert(echo_count == 1)
249+
echo_count = 0
250+
251+
test_run:switch('storage_2_a')
252+
assert(echo_count == 0)
253+
vshard.storage.enable()
254+
255+
test_run:switch('router_1')
256+
-- Drop the backoff.
257+
fiber.sleep(vshard.consts.REPLICA_BACKOFF_INTERVAL)
258+
-- Now goes to the best replica - it is enabled again.
259+
res, err = vshard.router.callro(1, 'echo', {100}, long_timeout)
260+
assert(res == 100)
261+
262+
test_run:switch('storage_2_a')
263+
assert(echo_count == 1)
264+
229265
_ = test_run:switch("default")
230266
_ = test_run:cmd("stop server router_1")
231267
_ = test_run:cmd("cleanup server router_1")

vshard/replicaset.lua

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -347,9 +347,21 @@ local function replica_call(replica, func, args, opts)
347347
if opts.timeout >= replica.net_timeout then
348348
replica_on_failed_request(replica)
349349
end
350+
local err = storage_status
351+
-- VShard functions can throw exceptions using error() function. When
352+
-- it reaches the network layer, it is wrapped into LuajitError. Try to
353+
-- extract the original error if this is the case. Not always is
354+
-- possible - the string representation could be truncated.
355+
--
356+
-- In old Tarantool versions LuajitError turned into ClientError on the
357+
-- client. Check both types.
358+
if func:startswith('vshard.') and (err.type == 'LuajitError' or
359+
err.type == 'ClientError') then
360+
err = lerror.from_string(err.message) or err
361+
end
350362
log.error("Exception during calling '%s' on '%s': %s", func, replica,
351-
storage_status)
352-
return false, nil, lerror.make(storage_status)
363+
err)
364+
return false, nil, lerror.make(err)
353365
else
354366
replica_on_success_request(replica)
355367
end
@@ -472,6 +484,9 @@ local function can_backoff_after_error(e, func)
472484
return e.message:startswith("Procedure 'vshard.")
473485
end
474486
end
487+
if e.type == 'ShardingError' then
488+
return e.code == vshard.error.code.STORAGE_IS_DISABLED
489+
end
475490
return false
476491
end
477492

0 commit comments

Comments
 (0)