4.x: Stabilize AdvancedShardAwarenessIT #676

Bouncheck · 2025-10-02T01:24:04Z

On recent Scylla versions this test started failing periodically.
It looks like with newer Scylla the driver somehow hits a scenario where
it successfully initializes a good portion of the connections, then
all connection attempts to one of the nodes get rejected.
It is accompanied by multiple erros like this:

19:38:41.582 [s0-admin-1] WARN  c.d.o.d.i.core.pool.ChannelPool - [s0|/127.0.2.2:19042]  Error while opening new channel
com.datastax.oss.driver.api.core.DriverTimeoutException: [s0|id: 0xfc42b7c7, L:/127.0.0.1:11854 - R:/127.0.2.2:19042] Protocol initialization request, step 1 (OPTIONS): timed out after 5000 ms
	at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:110)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

Increasing delays between reconnections or even increasing the test timeout
(largest value tried was 40 seconds) does not help with this situation.
The node logs do not show anything raising suspicion. Not even a WARN.

This change lowers the number of nodes to 1 (previously 2) and the number
of expected channels per session to 33 (previously 66) in resource heavy
test methods. Number of sessions remains at 4.
The reconnection delays in should_not_struggle_to_fill_pools will now
start at around 300ms and should not rise above 3200ms.
This is the smallest tested set of changes that seems to resolve the issue.
The test remains meaningful since should_struggle_to_fill_pools still
displays considerably worse performance without adv. shard awareness.

Bouncheck · 2025-10-02T02:11:40Z

Seems like somehow some logs leak between methods. I've got to fix that too.
Also channel pool can be null sometimes when checking the await's condition.

Bouncheck · 2025-10-02T10:49:17Z

~~Some of the reconnections come from RandomTokenVnodesIT which should not be running at all, since from what I see it does not have annotation with scylla backend and also has ScyllaSkip annotation.~~
This was incorrect, the reconnections are for the cluster created for the next test which is DefaultMetadataTabletMapIT.
RandomTokenVnodesIT startet taking a few seconds more even though its marked for skipping, so there was a slight change there too, but that does not make it the suspect here yet.

Bouncheck · 2025-10-02T14:41:03Z

Looks green now.
Created #678 to follow up on root cause.

nikagra · 2025-10-02T14:32:45Z

integration-tests/src/test/java/com/datastax/oss/driver/core/pool/AdvancedShardAwarenessIT.java

  public void should_initialize_all_channels(boolean reuseAddress) {
+    int poolLocalSizeSetting = 4; // Will round up to 6 due to not being divisible by 3 shards
+    int expectedChannelsPerNode = 6;
+    String node1 = CCM_RULE.getCcmBridge().getNodeIpAddress(1);


This is even better than what we discussed: you do not enforce the test to use some predefined IP prefix, but make an actual IP part of the regex pattern

Yes. Hardcoded ip would also work, but it could collide with something eventually.

nikagra · 2025-10-02T14:48:12Z

integration-tests/src/test/java/com/datastax/oss/driver/core/pool/AdvancedShardAwarenessIT.java

            .withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_LOW, 10000)
            .withInt(DefaultDriverOption.ADVANCED_SHARD_AWARENESS_PORT_HIGH, 60000)
-            .withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 64)
+            .withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, expectedChannelsPerNode)


I see you also reduced number of channels to just 6. Is it intentional?

Yes. In should_initialize_all_channels I made it 6 because it does not matter that much. It's mainly a sanity check that it does the basic thing.

Here (should_see_mismatched_shard) it is reduced to 33. Less than 64 but still enough to be sure that without advanced shard awareness it has pretty high chances to land on wrong shard several times.

dkropachev · 2025-10-02T17:08:47Z

@Bouncheck , I have couple of questions:

Why and when exactly it started failing, it was not like that two weeks ago ?
Why it is not failing on 6.1.5 and failing on other scylla versions, what is the difference ?

Bouncheck · 2025-10-02T18:30:20Z

@Bouncheck , I have couple of questions:
1. Why and when exactly it started failing, it was not like that two weeks ago ?

Why is currently unclear. First sighting seems to be github actions run after pushing this
ab2665f
However the final run visible under PR does not have the same failures
https://github.com/scylladb/java-driver/actions/runs/17986407382
I don't see anything significant in this commit that would explain the failures right now.

2. Why it is not failing on `6.1.5` and failing on other scylla versions, what is the difference ?

Also currently unclear. It could be something on the server side. One common thread i see between the failing runs is that there are sessions that try to communicate with the cluster created for DefaultMetadataTabletMapIT which is long gone.
It could be a mix of something incorrect in the test code that surfaced due to some change on the server side.

Those extra sessions and reconnections cause additional matches to appear in the logs, but they are unrelated to adv. shard awareness test. They also could be making port collisions or timeouts slightly more likely.

Bouncheck · 2025-10-02T19:52:39Z

Before merging this let's evaluate #682 .
I think switching from hardcoded sleeps to awaits and using more specific patterns are still worthwhile changes, but maybe i should not be increasing number of reconnections tolerance here.

dkropachev · 2025-10-11T14:44:43Z

@Bouncheck , please rebase it

On recent Scylla versions this test started failing periodically. It looks like with newer Scylla the driver somehow hits a scenario where it successfully initializes a good portion of the connections, then all connection attempts to one of the nodes get rejected. It is accompanied by multiple erros like this: ``` 19:38:41.582 [s0-admin-1] WARN c.d.o.d.i.core.pool.ChannelPool - [s0|/127.0.2.2:19042] Error while opening new channel com.datastax.oss.driver.api.core.DriverTimeoutException: [s0|id: 0xfc42b7c7, L:/127.0.0.1:11854 - R:/127.0.2.2:19042] Protocol initialization request, step 1 (OPTIONS): timed out after 5000 ms at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.onTimeout(ChannelHandlerRequest.java:110) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:160) at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) ``` Increasing delays between reconnections or even increasing the test timeout (largest value tried was 40 seconds) does not help with this situation. The node logs do not show anything raising suspicion. Not even a WARN. This change lowers the number of nodes to 1 (previously 2) and the number of expected channels per session to 33 (previously 66) in resource heavy test methods. Number of sessions remains at 4. The reconnection delays in `should_not_struggle_to_fill_pools` will now start at around 300ms and should not rise above 3200ms. This is the smallest tested set of changes that seems to resolve the issue. The test remains meaningful since `should_struggle_to_fill_pools` still displays considerably worse performance without adv. shard awareness.

Bouncheck · 2025-10-14T11:54:41Z

Rebased and reworked this PR. Updated the PR description to reflect the current changes.
Issues outside of adjusting test configuration were addressed in #737

Bouncheck self-assigned this Oct 2, 2025

Bouncheck marked this pull request as draft October 2, 2025 02:10

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 2c1d3b7 to e701f0d Compare October 2, 2025 13:01

nikagra self-requested a review October 2, 2025 13:06

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from e701f0d to 63e75ac Compare October 2, 2025 13:44

Bouncheck changed the title ~~Stabilize AdvancedShardAwarenessIT~~ 4.x: Stabilize AdvancedShardAwarenessIT Oct 2, 2025

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from 63e75ac to fe07a8c Compare October 2, 2025 14:05

nikagra requested a review from dkropachev October 2, 2025 14:29

Bouncheck marked this pull request as ready for review October 2, 2025 14:32

nikagra reviewed Oct 2, 2025

View reviewed changes

nikagra approved these changes Oct 2, 2025

View reviewed changes

Bouncheck mentioned this pull request Oct 6, 2025

Subset of fixes to AdvancedShardAwarenessIT #737

Merged

dkropachev force-pushed the scylla-4.x branch from 6ed5d82 to 18752f6 Compare October 9, 2025 11:55

Bouncheck force-pushed the scylla-4.x-stabilize-adv-shard-awareness-IT branch from fe07a8c to 53f16d9 Compare October 14, 2025 11:50

Uh oh!

4.x: Stabilize AdvancedShardAwarenessIT #676

Are you sure you want to change the base?

4.x: Stabilize AdvancedShardAwarenessIT #676

Uh oh!

Conversation

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025

Uh oh!

nikagra Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Bouncheck Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

nikagra Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Bouncheck Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkropachev commented Oct 2, 2025

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bouncheck commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkropachev commented Oct 11, 2025

Uh oh!

Bouncheck commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 2, 2025 •

edited

Loading

Bouncheck commented Oct 14, 2025 •

edited

Loading