Skip to content

Conversation

@emilk
Copy link
Member

@emilk emilk commented Nov 22, 2025

Some random notes from a sleep-deprived discussion with @nikolausWest

May be useful as a basis for a huddle. Definitely not ready for review.

Comment on lines +73 to +74
# Why are _some_ functions prefixed with `get_` and not others?
# Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth running a "pythonic review board" on the APIs when we're are done with defining the functional part. i'd at least volunteer @ntjohnson1 and @timsaucer for that.

return view.segment_table(join_meta=join_meta, join_key=join_key)

def manifest(self) -> Any:
# What is this? What is it for?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dataset manifest, aka a table describing the content of a dataset where each row is a layer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this is useful, especially for any users who leverage the layers heavily. The naming may need improvement though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to return a datafusion.DataFrame for consistency in:

Comment on lines +265 to +266
# wanted: def add_segments([recording])
# requires default_s3_bucket somewhere
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 and I really really want it (e.g. for tests, etc.). But imo this is new functionality, not update/fixing of existing APIs. Given that it's potentially a rabbit hole, i'd like to scope it out. (but again, I would love to have this priorities)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the blueprint-related decision: we need something like this, but exposing it for data is not top-priority

Comment on lines +268 to +277
# We have a bunch of potentially slow stuff here: register, create_fts_index, …
# What is the API we want?
# One option is to return a `Job` object with
# .block(), .id, .cancel()
# catalog.register(…).block(timeout=60)
# catalog.register(…) # non-blocking
# job = catalog.register(…)
# while not job.is_done():
# …
# job.block_and_tqdm()
Copy link
Member

@abey79 abey79 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a Tasks api already, which implements wait(timeout=...).

Support for tqdm (let's call it progress or sth) would be a very nice addition. I'm fine adding a tqdm dependency for this sole purpose because:

  • tqdm is extremely widespread/standard
  • we're splitting the redap sdk from the logging sdk anyways (right, RIGHT?)

Copy link
Member

@abey79 abey79 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision:

  • make .register() return a Tasks for consistency
  • we need build a nice API at the edge with clear error reporting (importantly OK/NOK on a per segment basis), cancelling, progress bar, etc.
  • make sure we have trace ids, etc. to help support/debugging
  • the existing redap low-level API is ok, but needs sugar coating on Python side.
  • expose Tasks as Job to "hide" the underlying implementation details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @ntjohnson1:

  • we already have a Task (singular) abstraction that I somehow overlooked this morning
  • a good API inspiration may come from concurrent.futures.as_completed, which we could add to Tasks

The latter would allow:

  • very easy user-provided progress-bar integration
  • get error on a per-task basis as they arise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% support this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntjohnson1 currently, register returns the segment id of the segment that was just registered. This will be "lost" if we return Tasks instead.

Q: was this returned segment id useful? (My gut feeling is yes, but I'd love to have your take on this.)

Tasks should obviously have the ability to provide the segment ids of what was registered, but:

  • from an API design perspective, it's tricky because segment registration is a special case of a task
  • I'm not even sure how this information is made available by redap in the first place

Alternative: register_* calls do not return any segment ids information. Escape hatch is to list segments before/after and diff.

cc @nikolausWest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arf... as @jleibs noted below, registration is about (segment, layer), not just segment. (I somehow keep forgetting about this.) So this pleads in favour of not attempting to return such information from register_* or Tasks. How bad does that sounds from a user's prespective?

def register(self, recording_uri: str, *, recording_layer: str = "base", timeout_secs: int = 60) -> str:
return self._inner.register(recording_uri, recording_layer=recording_layer, timeout_secs=timeout_secs)

# Do we really want/need a layer per URI?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is (for now) the very definition of what a layer is

Comment on lines +290 to +291
# I assume this registers everything in a s3 bucket?
# Why isn't this just the same call as `register`?
Copy link
Member

@abey79 abey79 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FS analogy would be:

  • register -> single file
  • register_batch -> list of files
  • register_prefix -> entire directory (somehow directories are called prefix in object store)

I would submit the idea of merging these two APIs into one (using kwargs shenanigans) to the "pythonic review board"

Copy link
Member

@abey79 abey79 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decision:

  • we want:
    • register 1 segment with a given layer name (optionally)
    • register N segments with a single layer name (optionally) (use multiple calls if you want multiple layer names)
    • register all rrds under a given prefix with a single layer name (optionally)
  • all of these returns Tasks
  • "pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much) (@abey79 to make a propose)
  • register -> register_segment/s for clarity (these register a segment/layer combo, not a segment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much)

@nikolausWest this feels like the opposite direction of the table API design where you pushed for 3 explicitly separate APIs (append/overwrite/upsert) instead of just having a single API and a write_mode parameter.

We need to fundamentally decide if we want a minimal number of APIs with complex parameters and behaviors, or a larger number of APIs with very specific clear functionality.

I am personally leaning toward the latter. e.g. in the table APIs a separate append_batches would make the docs and implementation of error handling more clear (none of this incompatible subsets of parameters business).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing this with @ntjohnson1, having a single register call is ok because all 3 can be used to achieved the exact same outcome (at the the cost of a loop or so), whereas append/overwrite/upsert achieve distinct tasks. So there should be one-- and preferably only one --obvious way to do it would apply here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came on here to agree with @abey79 's last statement. I don't think this is the same argument as having different append/overwrite/upsert because these are exactly 1 concept with 3 different input types. Having different input types and automatically knowing how to handle them is one thing python should be good at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding three separate actions vs one action with multiple input types I was thinking the same as @abey79 and @timsaucer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm disallowing it. List of layer name is allowed only when uri is a list as well, and length must match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy the argument for merging register and register batch. Promoting a single element to a list is fairly common. In both of them the total number of sub-tasks in the output is known and deterministic based on the input.

register_prefix, on the other hand, still feels like a legitimately odd-duck to me. It's actually not something where you can achieve the same thing with a loop. The user may not even have permissions for the s3 location they are trying to register. (For that matter the API itself also requires broader server permissions -- list vs read). The number of results is unknown when the user makes the call. The server behind the scenes also goes down a fairly distinct code path to scan the prefix first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user may not even have permissions for the s3 location they are trying to register

This seems more like a bug than a feature to me, but it does provide a weird edge case around the equivalence of these methods.

One thing that I mentioned to Antoine that might be captured elsewhere is if we make our Tasks/Jobs object nicer/more informative that might help to make it clearer their prefix is not aligned with their expectations. However, if they just use boto3 and rglob the prefix this same footgun is pretty easy to add a lot more data than you thought so pretty hard to design that away even with different method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the opinion that register and register_batch are very similar, and register_prefix a bit less so. Seeing this being debated, I'm personally undecided about merging them all vs. keeping register + register_prefix.

If I had to make the call, I would go for register/register_prefix because, facing a doubt, I pick explicit over full-on kwargs.

Comment on lines +317 to 319
# May be mistaking read in the imperative ("please index these ranges")
# No better suggestion though.
def index_ranges(self, index: str | IndexColumnDescriptor) -> datafusion.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_index_ranges might be a better name, yeah

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes to the above discussion for get_ prefixes and the conflict between brevity and being explicit. I would not expect index_ranges to perform indexing.

Copy link
Member

@abey79 abey79 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timsaucer I'm unsure about which direction you are trying to push.

As per my newfound "rule" of "skip get_ when it looks like a @property (but cannot be one)", this should have get_ as it takes a mandatory argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed in get_index_ranges in:

base_tokenizer=base_tokenizer,
)

# Not part of MVP. "Experimental - talk to sales"
Copy link
Member

@abey79 abey79 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)

# Is this only secondary indices? Should the name reflect that?
def list_indexes(self) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decision:

  • create_secondary_index (kwargs to define which type)
  • list_secondary_indices
  • delete_secondary_indices

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntjohnson1 proposal: "search index"

  • create_search_index
    • or create_fts_search_index and create_fts_search_index
  • list_search_indices
  • delete_search_indices
  • search
    • or fts_search/vector_search (docstring: "requires relevant search index")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair and more descriptive. Lets go with "search index"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mvp of that change implemented in:

def search_vector(self, query: Any, column: Any, top_k: int) -> Any:
return self._inner.search_vector(query, column, top_k)

# Should also return Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...well. The grpc currently doest return tasks. I'm not sure how this is even done in cloud. I suggest we leave this aside for this project, as this feature is entirely orthogonal to everything else.

│ │ type: Utf8 ┆ type: i32 │ │
│ ╞════════════╪════════════╡ │
│ │ __entries ┆ 3 │ │
│ │ __entries ┆ 3 │ │ # TODO(emilk): Can we remove __entries?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the entries table as a dataframe is essentially disappearing from the API, so it's ok for it to remain low-level

Comment on lines +76 to +77
# Should row_id be part of this?
# An API like `dataset_view.select_row_ids(row_ids)` would be the fastest way to fetch those rows of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hinges on exposing the ability to run dataframe queries with row_id as index, which we currently don't expose (at least on python side). TBD, but probably out of scope for the API project.

# datetime.datetime(2000, 1, 1, 0, 0, 1, microsecond=500),
# ],
# fill_latest_at=True,
# ).concat(dataset_view.reader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense for merging 1 or 2 views but seems impractical as a replacement for using_index_values on 1000s of segments.

# Create a view with all partitions
# TODO(Niko): Future API idea: "Sample the data at 20 Hz starting at this time point with latest-at"

# This is a cumbersome API. Can we do better?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you find the version that directly takes a DataFrame to be cumbersome?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we discussed that at length with Niko, and concluded that:

  • the current proposal is not great
  • it's not immediately obvious how to make it better
  • a lot of this discussion hinges on future features (time resample helper, alternate interpolation function, etc.)
  • we are unable to make a guess now as to how these future features will look like (aka chances are we'd make wrong choices)

=> we're OK to roll with the proposal in it's current state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the cumbersome comment. As Emil mentioned this was just some notes we wrote down between flights and not meant as a real full review on it's own. On first read through the combination of timeline + using_index_values + latest at seemed non-future proof and like there might be a more composable way to do it. One of the perceived problems was that we're likely to want more kinds of sampling methods that will force more kwargs. The other is that more fill methods will do the same. However, we weren't able to come up with something composable that was good (timeline, sampling, and fill methods belong together) and don't have enough clarity on the grounded needs for the sampling and fill methods to propose something better.

That's why we concluded it's best to keep it as is and then evolve this API as needed

Comment on lines +290 to +291
# I assume this registers everything in a s3 bucket?
# Why isn't this just the same call as `register`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much)

@nikolausWest this feels like the opposite direction of the table API design where you pushed for 3 explicitly separate APIs (append/overwrite/upsert) instead of just having a single API and a write_mode parameter.

We need to fundamentally decide if we want a minimal number of APIs with complex parameters and behaviors, or a larger number of APIs with very specific clear functionality.

I am personally leaning toward the latter. e.g. in the table APIs a separate append_batches would make the docs and implementation of error handling more clear (none of this incompatible subsets of parameters business).

# …
# job.block_and_tqdm()

# Call these `register_segment` and `register_segments` for clarity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge here is it's not a segment. It's a segment-layer (which would benefit from a better name). Several registrations can all add to the same segment.

def blueprint_dataset_id(self) -> EntryId | None:
return self._inner.blueprint_dataset_id()
# Remove these blueprint stuff, and just have
def set_default_blueprint(self, blueprint: Blueprint | Path) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the api can we just say set_blueprint? Because we either have the default blueprint or the heuristic on the viewer so this is the only configurable blueprint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +73 to +74
# Why are _some_ functions prefixed with `get_` and not others?
# Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general Python prefers brevity and Rust does not. So we have a case here were I suspect we are mirroring the internal APIs. I think we should remove the get_ prefixes where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current plan is:

def datasets() -> list[DatasetEntry]: ...
def get_dataset(*, id, name) -> DatasetEntry: ...

This morning, I got a bit annoyed with the inconsistency. Why is there a get_ prefix in one and not in the other? Yet, I felt that get_datasets() is unnecessary verbose, and dataset(id, name) is not really pythonic, leading to a bit of cognitive dissonance.

I poked around a bit and ended up coming to peace with the status quo. Rationale:

  • datasets() doesn't need an argument (it actually has one optional arg now to show the hidden entries, but that's niche), so it looks and feels like a property. It cannot be a property given that network calls are involved, but it is one in spirit.
  • get_dataset() requires an argument, so it is unambiguously an accessor, for which the get_ prefix is extremely common in Python. The core inspiration here being dict: my_dict.get("item")

Comment on lines +268 to +277
# We have a bunch of potentially slow stuff here: register, create_fts_index, …
# What is the API we want?
# One option is to return a `Job` object with
# .block(), .id, .cancel()
# catalog.register(…).block(timeout=60)
# catalog.register(…) # non-blocking
# job = catalog.register(…)
# while not job.is_done():
# …
# job.block_and_tqdm()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% support this

Comment on lines +290 to +291
# I assume this registers everything in a s3 bucket?
# Why isn't this just the same call as `register`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came on here to agree with @abey79 's last statement. I don't think this is the same argument as having different append/overwrite/upsert because these are exactly 1 concept with 3 different input types. Having different input types and automatically knowing how to handle them is one thing python should be good at.

Comment on lines +317 to 319
# May be mistaking read in the imperative ("please index these ranges")
# No better suggestion though.
def index_ranges(self, index: str | IndexColumnDescriptor) -> datafusion.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes to the above discussion for get_ prefixes and the conflict between brevity and being explicit. I would not expect index_ranges to perform indexing.

id: fixed_size_binary[16] not null
name: string not null
entry_kind: int32 not null
entry_kind: int32 not null # Why is this int32 and not an enum? Poor py-arrow support?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying data type is int32. We can create an extension type and then also update our rendering system to use those extension types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but imo out of scope for this project.

Comment on lines +26 to +27
│ METADATA: # TODO(emilk): can we skip this?
│ * version: 0.1.1 # TODO(emilk): can we skip this? Or call it `rerun_schema_version`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can update the rendering. Easily done. The question is: what about other metatdata if the user adds it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem here is that version is unclear. What version is it referring to? Can the user add their own metadata here?

Copy link
Member

@abey79 abey79 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code, it should be sorbet:version. I have no idea why we only see version here. Something about the formatting?

edit: confirmed, it's indeed sorbet:version in the data. I'm not sure why this gets printed as version for str(datafusion.DataFrame). Maybe a bug @timsaucer? Related to the handling of extension types? (These involve : as well iirc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added notes to the issue in the datafusion-python repo, but here is the offending source:

let batch_opts = RecordBatchFormatOpts::default();
The default options are to trim metadata, which is applied both at the header and for the columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! I've been completely fooled by this machinery I wasn't aware of. Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants