-
Notifications
You must be signed in to change notification settings - Fork 587
Add food-for-thought around new query API #11949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # Why are _some_ functions prefixed with `get_` and not others? | ||
| # Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth running a "pythonic review board" on the APIs when we're are done with defining the functional part. i'd at least volunteer @ntjohnson1 and @timsaucer for that.
| return view.segment_table(join_meta=join_meta, join_key=join_key) | ||
|
|
||
| def manifest(self) -> Any: | ||
| # What is this? What is it for? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dataset manifest, aka a table describing the content of a dataset where each row is a layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this is useful, especially for any users who leverage the layers heavily. The naming may need improvement though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to return a datafusion.DataFrame for consistency in:
| # wanted: def add_segments([recording]) | ||
| # requires default_s3_bucket somewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 and I really really want it (e.g. for tests, etc.). But imo this is new functionality, not update/fixing of existing APIs. Given that it's potentially a rabbit hole, i'd like to scope it out. (but again, I would love to have this priorities)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the blueprint-related decision: we need something like this, but exposing it for data is not top-priority
| # We have a bunch of potentially slow stuff here: register, create_fts_index, … | ||
| # What is the API we want? | ||
| # One option is to return a `Job` object with | ||
| # .block(), .id, .cancel() | ||
| # catalog.register(…).block(timeout=60) | ||
| # catalog.register(…) # non-blocking | ||
| # job = catalog.register(…) | ||
| # while not job.is_done(): | ||
| # … | ||
| # job.block_and_tqdm() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a Tasks api already, which implements wait(timeout=...).
Support for tqdm (let's call it progress or sth) would be a very nice addition. I'm fine adding a tqdm dependency for this sole purpose because:
- tqdm is extremely widespread/standard
- we're splitting the redap sdk from the logging sdk anyways (right, RIGHT?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decision:
- make
.register()return aTasksfor consistency - we need build a nice API at the edge with clear error reporting (importantly OK/NOK on a per segment basis), cancelling, progress bar, etc.
- make sure we have trace ids, etc. to help support/debugging
- the existing redap low-level API is ok, but needs sugar coating on Python side.
- expose
TasksasJobto "hide" the underlying implementation details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with @ntjohnson1:
- we already have a
Task(singular) abstraction that I somehow overlooked this morning - a good API inspiration may come from
concurrent.futures.as_completed, which we could add toTasks
The latter would allow:
- very easy user-provided progress-bar integration
- get error on a per-task basis as they arise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% support this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntjohnson1 currently, register returns the segment id of the segment that was just registered. This will be "lost" if we return Tasks instead.
Q: was this returned segment id useful? (My gut feeling is yes, but I'd love to have your take on this.)
Tasks should obviously have the ability to provide the segment ids of what was registered, but:
- from an API design perspective, it's tricky because segment registration is a special case of a task
- I'm not even sure how this information is made available by redap in the first place
Alternative: register_* calls do not return any segment ids information. Escape hatch is to list segments before/after and diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arf... as @jleibs noted below, registration is about (segment, layer), not just segment. (I somehow keep forgetting about this.) So this pleads in favour of not attempting to return such information from register_* or Tasks. How bad does that sounds from a user's prespective?
| def register(self, recording_uri: str, *, recording_layer: str = "base", timeout_secs: int = 60) -> str: | ||
| return self._inner.register(recording_uri, recording_layer=recording_layer, timeout_secs=timeout_secs) | ||
|
|
||
| # Do we really want/need a layer per URI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is (for now) the very definition of what a layer is
| # I assume this registers everything in a s3 bucket? | ||
| # Why isn't this just the same call as `register`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FS analogy would be:
register-> single fileregister_batch-> list of filesregister_prefix-> entire directory (somehow directories are called prefix in object store)
I would submit the idea of merging these two APIs into one (using kwargs shenanigans) to the "pythonic review board"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decision:
- we want:
- register 1 segment with a given layer name (optionally)
- register N segments with a single layer name (optionally) (use multiple calls if you want multiple layer names)
- register all rrds under a given prefix with a single layer name (optionally)
- all of these returns
Tasks - "pythonic review board" to decide how far to merge
register,register_batchandregister_prefix(the 3 of them sound way too much) (@abey79 to make a propose) (these register a segment/layer combo, not a segment)register->register_segment/sfor clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much)
@nikolausWest this feels like the opposite direction of the table API design where you pushed for 3 explicitly separate APIs (append/overwrite/upsert) instead of just having a single API and a write_mode parameter.
We need to fundamentally decide if we want a minimal number of APIs with complex parameters and behaviors, or a larger number of APIs with very specific clear functionality.
I am personally leaning toward the latter. e.g. in the table APIs a separate append_batches would make the docs and implementation of error handling more clear (none of this incompatible subsets of parameters business).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing this with @ntjohnson1, having a single register call is ok because all 3 can be used to achieved the exact same outcome (at the the cost of a loop or so), whereas append/overwrite/upsert achieve distinct tasks. So there should be one-- and preferably only one --obvious way to do it would apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came on here to agree with @abey79 's last statement. I don't think this is the same argument as having different append/overwrite/upsert because these are exactly 1 concept with 3 different input types. Having different input types and automatically knowing how to handle them is one thing python should be good at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding three separate actions vs one action with multiple input types I was thinking the same as @abey79 and @timsaucer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm disallowing it. List of layer name is allowed only when uri is a list as well, and length must match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I buy the argument for merging register and register batch. Promoting a single element to a list is fairly common. In both of them the total number of sub-tasks in the output is known and deterministic based on the input.
register_prefix, on the other hand, still feels like a legitimately odd-duck to me. It's actually not something where you can achieve the same thing with a loop. The user may not even have permissions for the s3 location they are trying to register. (For that matter the API itself also requires broader server permissions -- list vs read). The number of results is unknown when the user makes the call. The server behind the scenes also goes down a fairly distinct code path to scan the prefix first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user may not even have permissions for the s3 location they are trying to register
This seems more like a bug than a feature to me, but it does provide a weird edge case around the equivalence of these methods.
One thing that I mentioned to Antoine that might be captured elsewhere is if we make our Tasks/Jobs object nicer/more informative that might help to make it clearer their prefix is not aligned with their expectations. However, if they just use boto3 and rglob the prefix this same footgun is pretty easy to add a lot more data than you thought so pretty hard to design that away even with different method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the opinion that register and register_batch are very similar, and register_prefix a bit less so. Seeing this being debated, I'm personally undecided about merging them all vs. keeping register + register_prefix.
If I had to make the call, I would go for register/register_prefix because, facing a doubt, I pick explicit over full-on kwargs.
| # May be mistaking read in the imperative ("please index these ranges") | ||
| # No better suggestion though. | ||
| def index_ranges(self, index: str | IndexColumnDescriptor) -> datafusion.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_index_ranges might be a better name, yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes to the above discussion for get_ prefixes and the conflict between brevity and being explicit. I would not expect index_ranges to perform indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timsaucer I'm unsure about which direction you are trying to push.
As per my newfound "rule" of "skip get_ when it looks like a @property (but cannot be one)", this should have get_ as it takes a mandatory argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed in get_index_ranges in:
| base_tokenizer=base_tokenizer, | ||
| ) | ||
|
|
||
| # Not part of MVP. "Experimental - talk to sales" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) | ||
|
|
||
| # Is this only secondary indices? Should the name reflect that? | ||
| def list_indexes(self) -> list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decision:
create_secondary_index(kwargs to define which type)list_secondary_indicesdelete_secondary_indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntjohnson1 proposal: "search index"
create_search_index- or
create_fts_search_indexandcreate_fts_search_index
- or
list_search_indicesdelete_search_indicessearch- or
fts_search/vector_search(docstring: "requires relevant search index")
- or
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair and more descriptive. Lets go with "search index"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mvp of that change implemented in:
| def search_vector(self, query: Any, column: Any, top_k: int) -> Any: | ||
| return self._inner.search_vector(query, column, top_k) | ||
|
|
||
| # Should also return Job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...well. The grpc currently doest return tasks. I'm not sure how this is even done in cloud. I suggest we leave this aside for this project, as this feature is entirely orthogonal to everything else.
| │ │ type: Utf8 ┆ type: i32 │ │ | ||
| │ ╞════════════╪════════════╡ │ | ||
| │ │ __entries ┆ 3 │ │ | ||
| │ │ __entries ┆ 3 │ │ # TODO(emilk): Can we remove __entries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the entries table as a dataframe is essentially disappearing from the API, so it's ok for it to remain low-level
| # Should row_id be part of this? | ||
| # An API like `dataset_view.select_row_ids(row_ids)` would be the fastest way to fetch those rows of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hinges on exposing the ability to run dataframe queries with row_id as index, which we currently don't expose (at least on python side). TBD, but probably out of scope for the API project.
| # datetime.datetime(2000, 1, 1, 0, 0, 1, microsecond=500), | ||
| # ], | ||
| # fill_latest_at=True, | ||
| # ).concat(dataset_view.reader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense for merging 1 or 2 views but seems impractical as a replacement for using_index_values on 1000s of segments.
| # Create a view with all partitions | ||
| # TODO(Niko): Future API idea: "Sample the data at 20 Hz starting at this time point with latest-at" | ||
|
|
||
| # This is a cumbersome API. Can we do better? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you find the version that directly takes a DataFrame to be cumbersome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, we discussed that at length with Niko, and concluded that:
- the current proposal is not great
- it's not immediately obvious how to make it better
- a lot of this discussion hinges on future features (time resample helper, alternate interpolation function, etc.)
- we are unable to make a guess now as to how these future features will look like (aka chances are we'd make wrong choices)
=> we're OK to roll with the proposal in it's current state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the cumbersome comment. As Emil mentioned this was just some notes we wrote down between flights and not meant as a real full review on it's own. On first read through the combination of timeline + using_index_values + latest at seemed non-future proof and like there might be a more composable way to do it. One of the perceived problems was that we're likely to want more kinds of sampling methods that will force more kwargs. The other is that more fill methods will do the same. However, we weren't able to come up with something composable that was good (timeline, sampling, and fill methods belong together) and don't have enough clarity on the grounded needs for the sampling and fill methods to propose something better.
That's why we concluded it's best to keep it as is and then evolve this API as needed
| # I assume this registers everything in a s3 bucket? | ||
| # Why isn't this just the same call as `register`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much)
@nikolausWest this feels like the opposite direction of the table API design where you pushed for 3 explicitly separate APIs (append/overwrite/upsert) instead of just having a single API and a write_mode parameter.
We need to fundamentally decide if we want a minimal number of APIs with complex parameters and behaviors, or a larger number of APIs with very specific clear functionality.
I am personally leaning toward the latter. e.g. in the table APIs a separate append_batches would make the docs and implementation of error handling more clear (none of this incompatible subsets of parameters business).
| # … | ||
| # job.block_and_tqdm() | ||
|
|
||
| # Call these `register_segment` and `register_segments` for clarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge here is it's not a segment. It's a segment-layer (which would benefit from a better name). Several registrations can all add to the same segment.
| def blueprint_dataset_id(self) -> EntryId | None: | ||
| return self._inner.blueprint_dataset_id() | ||
| # Remove these blueprint stuff, and just have | ||
| def set_default_blueprint(self, blueprint: Blueprint | Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the api can we just say set_blueprint? Because we either have the default blueprint or the heuristic on the viewer so this is the only configurable blueprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mvp implemented in:
| # Why are _some_ functions prefixed with `get_` and not others? | ||
| # Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general Python prefers brevity and Rust does not. So we have a case here were I suspect we are mirroring the internal APIs. I think we should remove the get_ prefixes where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current plan is:
def datasets() -> list[DatasetEntry]: ...
def get_dataset(*, id, name) -> DatasetEntry: ...This morning, I got a bit annoyed with the inconsistency. Why is there a get_ prefix in one and not in the other? Yet, I felt that get_datasets() is unnecessary verbose, and dataset(id, name) is not really pythonic, leading to a bit of cognitive dissonance.
I poked around a bit and ended up coming to peace with the status quo. Rationale:
datasets()doesn't need an argument (it actually has one optional arg now to show the hidden entries, but that's niche), so it looks and feels like a property. It cannot be a property given that network calls are involved, but it is one in spirit.get_dataset()requires an argument, so it is unambiguously an accessor, for which theget_prefix is extremely common in Python. The core inspiration here beingdict:my_dict.get("item")
| # We have a bunch of potentially slow stuff here: register, create_fts_index, … | ||
| # What is the API we want? | ||
| # One option is to return a `Job` object with | ||
| # .block(), .id, .cancel() | ||
| # catalog.register(…).block(timeout=60) | ||
| # catalog.register(…) # non-blocking | ||
| # job = catalog.register(…) | ||
| # while not job.is_done(): | ||
| # … | ||
| # job.block_and_tqdm() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% support this
| # I assume this registers everything in a s3 bucket? | ||
| # Why isn't this just the same call as `register`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came on here to agree with @abey79 's last statement. I don't think this is the same argument as having different append/overwrite/upsert because these are exactly 1 concept with 3 different input types. Having different input types and automatically knowing how to handle them is one thing python should be good at.
| # May be mistaking read in the imperative ("please index these ranges") | ||
| # No better suggestion though. | ||
| def index_ranges(self, index: str | IndexColumnDescriptor) -> datafusion.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes to the above discussion for get_ prefixes and the conflict between brevity and being explicit. I would not expect index_ranges to perform indexing.
| id: fixed_size_binary[16] not null | ||
| name: string not null | ||
| entry_kind: int32 not null | ||
| entry_kind: int32 not null # Why is this int32 and not an enum? Poor py-arrow support? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying data type is int32. We can create an extension type and then also update our rendering system to use those extension types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but imo out of scope for this project.
| │ METADATA: # TODO(emilk): can we skip this? │ | ||
| │ * version: 0.1.1 # TODO(emilk): can we skip this? Or call it `rerun_schema_version`? │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can update the rendering. Easily done. The question is: what about other metatdata if the user adds it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem here is that version is unclear. What version is it referring to? Can the user add their own metadata here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code, it should be sorbet:version. I have no idea why we only see version here. Something about the formatting?
edit: confirmed, it's indeed sorbet:version in the data. I'm not sure why this gets printed as version for str(datafusion.DataFrame). Maybe a bug @timsaucer? Related to the handling of extension types? (These involve : as well iirc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added notes to the issue in the datafusion-python repo, but here is the offending source:
| let batch_opts = RecordBatchFormatOpts::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! I've been completely fooled by this machinery I wasn't aware of. Sorry for the noise.
Some random notes from a sleep-deprived discussion with @nikolausWest
May be useful as a basis for a huddle. Definitely not ready for review.