-
Couldn't load subscription status.
- Fork 39
4.x: Stabilize AdvancedShardAwarenessIT #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: scylla-4.x
Are you sure you want to change the base?
4.x: Stabilize AdvancedShardAwarenessIT #676
Conversation
|
Seems like somehow some logs leak between methods. I've got to fix that too. |
|
|
2c1d3b7 to
e701f0d
Compare
e701f0d to
63e75ac
Compare
63e75ac to
fe07a8c
Compare
|
Looks green now. |
| public void should_initialize_all_channels(boolean reuseAddress) { | ||
| int poolLocalSizeSetting = 4; // Will round up to 6 due to not being divisible by 3 shards | ||
| int expectedChannelsPerNode = 6; | ||
| String node1 = CCM_RULE.getCcmBridge().getNodeIpAddress(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even better than what we discussed: you do not enforce the test to use some predefined IP prefix, but make an actual IP part of the regex pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Hardcoded ip would also work, but it could collide with something eventually.
| .withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_LOW, 10000) | ||
| .withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_HIGH, 60000) | ||
| .withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 64) | ||
| .withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, expectedChannelsPerNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you also reduced number of channels to just 6. Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In should_initialize_all_channels I made it 6 because it does not matter that much. It's mainly a sanity check that it does the basic thing.
Here (should_see_mismatched_shard) it is reduced to 33. Less than 64 but still enough to be sure that without advanced shard awareness it has pretty high chances to land on wrong shard several times.
|
@Bouncheck , I have couple of questions:
|
Why is currently unclear. First sighting seems to be github actions run after pushing this
Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone. Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely. |
|
Before merging this let's evaluate #682 . |
|
@Bouncheck , please rebase it |
On recent Scylla versions this test started failing periodically. It looks like with newer Scylla the driver somehow hits a scenario where it successfully initializes a good portion of the connections, then all connection attempts to one of the nodes get rejected. It is accompanied by multiple erros like this: ``` 19:38:41.582 [s0-admin-1] WARN c.d.o.d.i.core.pool.ChannelPool - [s0|/127.0.2.2:19042] Error while opening new channel com.datastax.oss.driver.api.core.DriverTimeoutException: [s0|id: 0xfc42b7c7, L:/127.0.0.1:11854 - R:/127.0.2.2:19042] Protocol initialization request, step 1 (OPTIONS): timed out after 5000 ms at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:110) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160) at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) ``` Increasing delays between reconnections or even increasing the test timeout (largest value tried was 40 seconds) does not help with this situation. The node logs do not show anything raising suspicion. Not even a WARN. This change lowers the number of nodes to 1 (previously 2) and the number of expected channels per session to 33 (previously 66) in resource heavy test methods. Number of sessions remains at 4. The reconnection delays in `should_not_struggle_to_fill_pools` will now start at around 300ms and should not rise above 3200ms. This is the smallest tested set of changes that seems to resolve the issue. The test remains meaningful since `should_struggle_to_fill_pools` still displays considerably worse performance without adv. shard awareness.
fe07a8c to
53f16d9
Compare
|
Rebased and reworked this PR. Updated the PR description to reflect the current changes. |
On recent Scylla versions this test started failing periodically.
It looks like with newer Scylla the driver somehow hits a scenario where
it successfully initializes a good portion of the connections, then
all connection attempts to one of the nodes get rejected.
It is accompanied by multiple erros like this:
Increasing delays between reconnections or even increasing the test timeout
(largest value tried was 40 seconds) does not help with this situation.
The node logs do not show anything raising suspicion. Not even a WARN.
This change lowers the number of nodes to 1 (previously 2) and the number
of expected channels per session to 33 (previously 66) in resource heavy
test methods. Number of sessions remains at 4.
The reconnection delays in
should_not_struggle_to_fill_poolswill nowstart at around 300ms and should not rise above 3200ms.
This is the smallest tested set of changes that seems to resolve the issue.
The test remains meaningful since
should_struggle_to_fill_poolsstilldisplays considerably worse performance without adv. shard awareness.