Skip to content

Conversation

@Bouncheck
Copy link

@Bouncheck Bouncheck commented Oct 2, 2025

On recent Scylla versions this test started failing periodically.
It looks like with newer Scylla the driver somehow hits a scenario where
it successfully initializes a good portion of the connections, then
all connection attempts to one of the nodes get rejected.
It is accompanied by multiple erros like this:

19:38:41.582 [s0-admin-1] WARN  c.d.o.d.i.core.pool.ChannelPool - [s0|/127.0.2.2:19042]  Error while opening new channel
com.datastax.oss.driver.api.core.DriverTimeoutException: [s0|id: 0xfc42b7c7, L:/127.0.0.1:11854 - R:/127.0.2.2:19042] Protocol initialization request, step 1 (OPTIONS): timed out after 5000 ms
	at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:110)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

Increasing delays between reconnections or even increasing the test timeout
(largest value tried was 40 seconds) does not help with this situation.
The node logs do not show anything raising suspicion. Not even a WARN.

This change lowers the number of nodes to 1 (previously 2) and the number
of expected channels per session to 33 (previously 66) in resource heavy
test methods. Number of sessions remains at 4.
The reconnection delays in should_not_struggle_to_fill_pools will now
start at around 300ms and should not rise above 3200ms.
This is the smallest tested set of changes that seems to resolve the issue.
The test remains meaningful since should_struggle_to_fill_pools still
displays considerably worse performance without adv. shard awareness.

@Bouncheck Bouncheck self-assigned this Oct 2, 2025
@Bouncheck Bouncheck marked this pull request as draft October 2, 2025 02:10
@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

Seems like somehow some logs leak between methods. I've got to fix that too.
Also channel pool can be null sometimes when checking the await's condition.

@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

Some of the reconnections come from RandomTokenVnodesIT which should not be running at all, since from what I see it does not have annotation with scylla backend and also has ScyllaSkip annotation.
This was incorrect, the reconnections are for the cluster created for the next test which is DefaultMetadataTabletMapIT.
RandomTokenVnodesIT startet taking a few seconds more even though its marked for skipping, so there was a slight change there too, but that does not make it the suspect here yet.

@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 2c1d3b7 to e701f0d Compare October 2, 2025 13:01
@nikagra nikagra self-requested a review October 2, 2025 13:06
@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from e701f0d to 63e75ac Compare October 2, 2025 13:44
@Bouncheck Bouncheck changed the title Stabilize AdvancedShardAwarenessIT 4.x: Stabilize AdvancedShardAwarenessIT Oct 2, 2025
@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 63e75ac to fe07a8c Compare October 2, 2025 14:05
@nikagra nikagra requested a review from dkropachev October 2, 2025 14:29
@Bouncheck Bouncheck marked this pull request as ready for review October 2, 2025 14:32
@Bouncheck
Copy link
Author

Looks green now.
Created #678 to follow up on root cause.

public void should_initialize_all_channels(boolean reuseAddress) {
int poolLocalSizeSetting = 4; // Will round up to 6 due to not being divisible by 3 shards
int expectedChannelsPerNode = 6;
String node1 = CCM_RULE.getCcmBridge().getNodeIpAddress(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even better than what we discussed: you do not enforce the test to use some predefined IP prefix, but make an actual IP part of the regex pattern

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Hardcoded ip would also work, but it could collide with something eventually.

.withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_LOW, 10000)
.withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_HIGH, 60000)
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 64)
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, expectedChannelsPerNode)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also reduced number of channels to just 6. Is it intentional?

Copy link
Author

@Bouncheck Bouncheck Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In should_initialize_all_channels I made it 6 because it does not matter that much. It's mainly a sanity check that it does the basic thing.

Here (should_see_mismatched_shard) it is reduced to 33. Less than 64 but still enough to be sure that without advanced shard awareness it has pretty high chances to land on wrong shard several times.

@dkropachev
Copy link

@Bouncheck , I have couple of questions:

  1. Why and when exactly it started failing, it was not like that two weeks ago ?
  2. Why it is not failing on 6.1.5 and failing on other scylla versions, what is the difference ?

@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

@Bouncheck , I have couple of questions:

1. Why and when exactly it started failing, it was not like that two weeks ago ?

Why is currently unclear. First sighting seems to be github actions run after pushing this
ab2665f
However the final run visible under PR does not have the same failures
https://github.com/scylladb/java-driver/actions/runs/17986407382
I don't see anything significant in this commit that would explain the failures right now.

2. Why it is not failing on `6.1.5` and failing on other scylla versions, what is the difference ?

Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone.
It could be a mix of something incorrect in the test code that surfaced due to some change on the server side.

Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely.

@Bouncheck
Copy link
Author

Bouncheck commented Oct 2, 2025

Before merging this let's evaluate #682 .
I think switching from hardcoded sleeps to awaits and using more specific patterns are still worthwhile changes, but maybe i should not be increasing number of reconnections tolerance here.

@dkropachev
Copy link

@Bouncheck , please rebase it

On recent Scylla versions this test started failing periodically.
It looks like with newer Scylla the driver somehow hits a scenario where
it successfully initializes a good portion of the connections, then
all connection attempts to one of the nodes get rejected.
It is accompanied by multiple erros like this:
```
19:38:41.582 [s0-admin-1] WARN  c.d.o.d.i.core.pool.ChannelPool - [s0|/127.0.2.2:19042]  Error while opening new channel
com.datastax.oss.driver.api.core.DriverTimeoutException: [s0|id: 0xfc42b7c7, L:/127.0.0.1:11854 - R:/127.0.2.2:19042] Protocol initialization request, step 1 (OPTIONS): timed out after 5000 ms
	at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:110)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
```
Increasing delays between reconnections or even increasing the test timeout
(largest value tried was 40 seconds) does not help with this situation.
The node logs do not show anything raising suspicion. Not even a WARN.

This change lowers the number of nodes to 1 (previously 2) and the number
of expected channels per session to 33 (previously 66) in resource heavy
test methods. Number of sessions remains at 4.
The reconnection delays in `should_not_struggle_to_fill_pools` will now
start at around 300ms and should not rise above 3200ms.
This is the smallest tested set of changes that seems to resolve the issue.
The test remains meaningful since `should_struggle_to_fill_pools` still
displays considerably worse performance without adv. shard awareness.
@Bouncheck Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from fe07a8c to 53f16d9 Compare October 14, 2025 11:50
@Bouncheck
Copy link
Author

Bouncheck commented Oct 14, 2025

Rebased and reworked this PR. Updated the PR description to reflect the current changes.
Issues outside of adjusting test configuration were addressed in #737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants