Skip to content

Document read-after-write semantics for getRegister #131522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DaveCTurner
Copy link
Contributor

Clarifies in its documentation that BlobContainer#getRegister offers
only read-after-write semantics rather than full linearizability, and
adds comments to its callers justifying why this is still safe.

Clarifies in its documentation that `BlobContainer#getRegister` offers
only read-after-write semantics rather than full linearizability, and
adds comments to its callers justifying why this is still safe.
@DaveCTurner DaveCTurner requested review from bcully and ywangd July 18, 2025 13:11
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.2.0 labels Jul 18, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@bcully bcully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 833 to 835
// Step 4: Read the current register value. Calling getRegister is safe here because all earlier uploads are complete at
// this point, our upload is not completing yet, and later uploads can only be completing if they have already aborted ours,
// so either this read is linearizable or its result does not matter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this. Would it be useful to be explicit about write failures? I wrote this down for myself when I was reading the surrounding code, which I haven't spent much time in:

getRegister only has read-after-write semantics, which means it can return a stale value if another write is in flight, or if the last write failed.
In either of these cases the stale value does not break the correctness of this compare and exchange operation:

  • If another write is in flight, then the writer must have first aborted this upload, and so this operation will fail in step 5.
  • If the previous write failed, then that writer will fail its compare and exchange operation and nothing will depend on it, so it is safe to overwrite here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the last write failed

I am not sure how this could leave a stale value for this writer to read? Seems a serious bug if a failed write still writes to the register? Maybe I am missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I'm not sure I understand this either. If the previous write failed then it did not modify the contents of the register blob, so we're not really overwriting it.

Copy link
Contributor

@bcully bcully Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general there are various things that have to happen after the write is applied before the request completes successfully to the client. The write itself is applied atomically but the machinery that ensures that reads see it isn't atomic with the write, and failures on that path will cause the request to fail to the client. Failures there also mean that reads can spend an indefinite amount of time returning either the previous value or the one just written, even though the client saw a failure on it.

In fact I think if a client issues several different writes and they all fail, then until there is a successful write a reader may see any of the attempted writes or the previous successfully written value :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry, somehow I missed updates to this thread)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok, and this also applies to e.g. a network outage blocking the response from a write. I opened #132173 to document this.

until there is a successful write

This is semantically pretty tricky (and may not even apply to the network outage case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also how does this interact with the ListMultipartUploads API when the write is a MPU? If the write didn't completely update all the visible copies then I'd expect the MPU still appears in the MPU list. Moreover I'd expect that if we called AbortMultipartUpload on this MPU then that would also resolve the situation (we can't actually abort it so this should retry its completion and then return 404 NoSuchUpload).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think failed writes are basically writes in flight for the purpose of read-after-write semantics, but they remain "in flight" until the next successful write, instead of resolving when the client sees the result of the request.

I don't have any special knowledge of how AbortMultipartUpload works, but I'd agree that if it succeeds, that seems like it must resolve the state of the MPU (it shouldn't be visible after the abort succeeds to the client). Of course it could fail too, and then we have no additional information :)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +360 to +364
* <p>
* This operation has read-after-write consistency with respect to writes performed using {@link #compareAndExchangeRegister} and
* {@link #compareAndSetRegister}, but does not guarantee full linearizability. In particular, a {@code getRegister} performed during
* one of these write operations may return either the old or the new value, and a caller may therefore observe the old value
* <i>after</i> observing the new value, as long as both such read operations take place before the write operation completes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this comment or part of it be added to S3BlobContainer instead? The default implementation usees compareAndExchangeRegister which is linearizable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the same concerns apply to the GCS and Azure implementations. We're documenting the abstract interface in this comment, the default implementation isn't important (as long as it conforms to the contract of the abstract interface, which it does since it has stronger semantics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we can remove that default implementation, it's not actually needed any more, see #131604.

Comment on lines 833 to 835
// Step 4: Read the current register value. Calling getRegister is safe here because all earlier uploads are complete at
// this point, our upload is not completing yet, and later uploads can only be completing if they have already aborted ours,
// so either this read is linearizable or its result does not matter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the last write failed

I am not sure how this could leave a stale value for this writer to read? Seems a serious bug if a failed write still writes to the register? Maybe I am missing something.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 21, 2025
@elasticsearchmachine elasticsearchmachine merged commit 888e9a2 into elastic:main Jul 21, 2025
33 checks passed
@DaveCTurner DaveCTurner deleted the 2025/07/18/getRegister-read-after-write branch July 21, 2025 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants