Document read-after-write semantics for `getRegister` #131522

DaveCTurner · 2025-07-18T13:11:07Z

Clarifies in its documentation that BlobContainer#getRegister offers
only read-after-write semantics rather than full linearizability, and
adds comments to its callers justifying why this is still safe.

Clarifies in its documentation that `BlobContainer#getRegister` offers only read-after-write semantics rather than full linearizability, and adds comments to its callers justifying why this is still safe.

elasticsearchmachine · 2025-07-18T13:11:32Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

bcully

LGTM

bcully · 2025-07-18T17:31:15Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobContainer.java

+                // Step 4: Read the current register value. Calling getRegister is safe here because all earlier uploads are complete at
+                // this point, our upload is not completing yet, and later uploads can only be completing if they have already aborted ours,
+                // so either this read is linearizable or its result does not matter.


I believe this. Would it be useful to be explicit about write failures? I wrote this down for myself when I was reading the surrounding code, which I haven't spent much time in:

getRegister only has read-after-write semantics, which means it can return a stale value if another write is in flight, or if the last write failed.
In either of these cases the stale value does not break the correctness of this compare and exchange operation:

If another write is in flight, then the writer must have first aborted this upload, and so this operation will fail in step 5.

If the previous write failed, then that writer will fail its compare and exchange operation and nothing will depend on it, so it is safe to overwrite here.

if the last write failed

I am not sure how this could leave a stale value for this writer to read? Seems a serious bug if a failed write still writes to the register? Maybe I am missing something.

++ I'm not sure I understand this either. If the previous write failed then it did not modify the contents of the register blob, so we're not really overwriting it.

in general there are various things that have to happen after the write is applied before the request completes successfully to the client. The write itself is applied atomically but the machinery that ensures that reads see it isn't atomic with the write, and failures on that path will cause the request to fail to the client. Failures there also mean that reads can spend an indefinite amount of time returning either the previous value or the one just written, even though the client saw a failure on it.

In fact I think if a client issues several different writes and they all fail, then until there is a successful write a reader may see any of the attempted writes or the previous successfully written value :)

(sorry, somehow I missed updates to this thread)

I see, ok, and this also applies to e.g. a network outage blocking the response from a write. I opened #132173 to document this.

until there is a successful write

This is semantically pretty tricky (and may not even apply to the network outage case).

Also how does this interact with the ListMultipartUploads API when the write is a MPU? If the write didn't completely update all the visible copies then I'd expect the MPU still appears in the MPU list. Moreover I'd expect that if we called AbortMultipartUpload on this MPU then that would also resolve the situation (we can't actually abort it so this should retry its completion and then return 404 NoSuchUpload).

Overall I think failed writes are basically writes in flight for the purpose of read-after-write semantics, but they remain "in flight" until the next successful write, instead of resolving when the client sees the result of the request.

I don't have any special knowledge of how AbortMultipartUpload works, but I'd agree that if it succeeds, that seems like it must resolve the state of the MPU (it shouldn't be visible after the abort succeeds to the client). Of course it could fail too, and then we have no additional information :)

ywangd

LGTM

ywangd · 2025-07-21T00:20:53Z

server/src/main/java/org/elasticsearch/common/blobstore/BlobContainer.java

+     * <p>
+     * This operation has read-after-write consistency with respect to writes performed using {@link #compareAndExchangeRegister} and
+     * {@link #compareAndSetRegister}, but does not guarantee full linearizability. In particular, a {@code getRegister} performed during
+     * one of these write operations may return either the old or the new value, and a caller may therefore observe the old value
+     * <i>after</i> observing the new value, as long as both such read operations take place before the write operation completes.


Should this comment or part of it be added to S3BlobContainer instead? The default implementation usees compareAndExchangeRegister which is linearizable?

I believe the same concerns apply to the GCS and Azure implementations. We're documenting the abstract interface in this comment, the default implementation isn't important (as long as it conforms to the contract of the abstract interface, which it does since it has stronger semantics).

In fact we can remove that default implementation, it's not actually needed any more, see #131604.

ywangd · 2025-07-21T00:28:12Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobContainer.java

+                // Step 4: Read the current register value. Calling getRegister is safe here because all earlier uploads are complete at
+                // this point, our upload is not completing yet, and later uploads can only be completing if they have already aborted ours,
+                // so either this read is linearizable or its result does not matter.


if the last write failed

I am not sure how this could leave a stale value for this writer to read? Seems a serious bug if a failed write still writes to the register? Maybe I am missing something.

Document read-after-write semantics for getRegister

6d78a0b

Clarifies in its documentation that `BlobContainer#getRegister` offers only read-after-write semantics rather than full linearizability, and adds comments to its callers justifying why this is still safe.

DaveCTurner requested review from bcully and ywangd July 18, 2025 13:11

DaveCTurner added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.2.0 labels Jul 18, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 18, 2025

bcully reviewed Jul 18, 2025

View reviewed changes

ywangd approved these changes Jul 21, 2025

View reviewed changes

DaveCTurner added 2 commits July 21, 2025 08:03

Merge branch 'main' into 2025/07/18/getRegister-read-after-write

d988686

Minor wording improvements

6aa6f69

DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 21, 2025

DaveCTurner added 2 commits July 21, 2025 09:22

Merge branch 'main' into 2025/07/18/getRegister-read-after-write

08419db

More wording improvements

d55c70e

elasticsearchmachine merged commit 888e9a2 into elastic:main Jul 21, 2025
33 checks passed

DaveCTurner deleted the 2025/07/18/getRegister-read-after-write branch July 21, 2025 10:04

Document read-after-write semantics for getRegister #131522

Document read-after-write semantics for getRegister #131522

Uh oh!

Conversation

DaveCTurner commented Jul 18, 2025

Uh oh!

elasticsearchmachine commented Jul 18, 2025

Uh oh!

bcully left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bcully Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Document read-after-write semantics for `getRegister` #131522

Document read-after-write semantics for `getRegister` #131522

bcully Jul 24, 2025 •

edited

Loading