-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Document read-after-write semantics for getRegister
#131522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document read-after-write semantics for getRegister
#131522
Conversation
Clarifies in its documentation that `BlobContainer#getRegister` offers only read-after-write semantics rather than full linearizability, and adds comments to its callers justifying why this is still safe.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// Step 4: Read the current register value. Calling getRegister is safe here because all earlier uploads are complete at | ||
// this point, our upload is not completing yet, and later uploads can only be completing if they have already aborted ours, | ||
// so either this read is linearizable or its result does not matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this. Would it be useful to be explicit about write failures? I wrote this down for myself when I was reading the surrounding code, which I haven't spent much time in:
getRegister only has read-after-write semantics, which means it can return a stale value if another write is in flight, or if the last write failed.
In either of these cases the stale value does not break the correctness of this compare and exchange operation:
- If another write is in flight, then the writer must have first aborted this upload, and so this operation will fail in step 5.
- If the previous write failed, then that writer will fail its compare and exchange operation and nothing will depend on it, so it is safe to overwrite here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the last write failed
I am not sure how this could leave a stale value for this writer to read? Seems a serious bug if a failed write still writes to the register? Maybe I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I'm not sure I understand this either. If the previous write failed then it did not modify the contents of the register blob, so we're not really overwriting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general there are various things that have to happen after the write is applied before the request completes successfully to the client. The write itself is applied atomically but the machinery that ensures that reads see it isn't atomic with the write, and failures on that path will cause the request to fail to the client. Failures there also mean that reads can spend an indefinite amount of time returning either the previous value or the one just written, even though the client saw a failure on it.
In fact I think if a client issues several different writes and they all fail, then until there is a successful write a reader may see any of the attempted writes or the previous successfully written value :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry, somehow I missed updates to this thread)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, ok, and this also applies to e.g. a network outage blocking the response from a write. I opened #132173 to document this.
until there is a successful write
This is semantically pretty tricky (and may not even apply to the network outage case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also how does this interact with the ListMultipartUploads
API when the write is a MPU? If the write didn't completely update all the visible copies then I'd expect the MPU still appears in the MPU list. Moreover I'd expect that if we called AbortMultipartUpload
on this MPU then that would also resolve the situation (we can't actually abort it so this should retry its completion and then return 404 NoSuchUpload
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think failed writes are basically writes in flight for the purpose of read-after-write semantics, but they remain "in flight" until the next successful write, instead of resolving when the client sees the result of the request.
I don't have any special knowledge of how AbortMultipartUpload
works, but I'd agree that if it succeeds, that seems like it must resolve the state of the MPU (it shouldn't be visible after the abort succeeds to the client). Of course it could fail too, and then we have no additional information :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* <p> | ||
* This operation has read-after-write consistency with respect to writes performed using {@link #compareAndExchangeRegister} and | ||
* {@link #compareAndSetRegister}, but does not guarantee full linearizability. In particular, a {@code getRegister} performed during | ||
* one of these write operations may return either the old or the new value, and a caller may therefore observe the old value | ||
* <i>after</i> observing the new value, as long as both such read operations take place before the write operation completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this comment or part of it be added to S3BlobContainer
instead? The default implementation usees compareAndExchangeRegister
which is linearizable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the same concerns apply to the GCS and Azure implementations. We're documenting the abstract interface in this comment, the default implementation isn't important (as long as it conforms to the contract of the abstract interface, which it does since it has stronger semantics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact we can remove that default implementation, it's not actually needed any more, see #131604.
// Step 4: Read the current register value. Calling getRegister is safe here because all earlier uploads are complete at | ||
// this point, our upload is not completing yet, and later uploads can only be completing if they have already aborted ours, | ||
// so either this read is linearizable or its result does not matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the last write failed
I am not sure how this could leave a stale value for this writer to read? Seems a serious bug if a failed write still writes to the register? Maybe I am missing something.
Clarifies in its documentation that
BlobContainer#getRegister
offersonly read-after-write semantics rather than full linearizability, and
adds comments to its callers justifying why this is still safe.