Add food-for-thought around new query API #11949

emilk · 2025-11-22T10:22:55Z

Some random notes from a sleep-deprived discussion with @nikolausWest

May be useful as a basis for a huddle. Definitely not ready for review.

abey79 · 2025-11-24T08:40:15Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # Why are _some_ functions prefixed with `get_` and not others?
+    # Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK


worth running a "pythonic review board" on the APIs when we're are done with defining the functional part. i'd at least volunteer @ntjohnson1 and @timsaucer for that.

abey79 · 2025-11-24T08:42:08Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

        return view.segment_table(join_meta=join_meta, join_key=join_key)

-    def manifest(self) -> Any:
+    # What is this? What is it for?


the dataset manifest, aka a table describing the content of a dataset where each row is a layer

I do think this is useful, especially for any users who leverage the layers heavily. The naming may need improvement though.

changed to return a datafusion.DataFrame for consistency in:

More update to the API sandbox #11969

abey79 · 2025-11-24T08:44:02Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # wanted: def add_segments([recording])
+    # requires default_s3_bucket somewhere


💯 and I really really want it (e.g. for tests, etc.). But imo this is new functionality, not update/fixing of existing APIs. Given that it's potentially a rabbit hole, i'd like to scope it out. (but again, I would love to have this priorities)

from the blueprint-related decision: we need something like this, but exposing it for data is not top-priority

abey79 · 2025-11-24T08:46:21Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # We have a bunch of potentially slow stuff here: register, create_fts_index, …
+    # What is the API we want?
+    # One option is to return a `Job` object with
+    # .block(), .id, .cancel()
+    # catalog.register(…).block(timeout=60)
+    # catalog.register(…) # non-blocking
+    # job = catalog.register(…)
+    # while not job.is_done():
+    #     …
+    # job.block_and_tqdm()


We have a Tasks api already, which implements wait(timeout=...).

Support for tqdm (let's call it progress or sth) would be a very nice addition. I'm fine adding a tqdm dependency for this sole purpose because:

tqdm is extremely widespread/standard

we're splitting the redap sdk from the logging sdk anyways (right, RIGHT?)

Decision:

make .register() return a Tasks for consistency

we need build a nice API at the edge with clear error reporting (importantly OK/NOK on a per segment basis), cancelling, progress bar, etc.

make sure we have trace ids, etc. to help support/debugging

the existing redap low-level API is ok, but needs sugar coating on Python side.

expose Tasks as Job to "hide" the underlying implementation details?

After discussing with @ntjohnson1:

we already have a Task (singular) abstraction that I somehow overlooked this morning

a good API inspiration may come from concurrent.futures.as_completed, which we could add to Tasks

The latter would allow:

very easy user-provided progress-bar integration

get error on a per-task basis as they arise

100% support this

Sounds good!

@ntjohnson1 currently, register returns the segment id of the segment that was just registered. This will be "lost" if we return Tasks instead.

Q: was this returned segment id useful? (My gut feeling is yes, but I'd love to have your take on this.)

Tasks should obviously have the ability to provide the segment ids of what was registered, but:

from an API design perspective, it's tricky because segment registration is a special case of a task

I'm not even sure how this information is made available by redap in the first place

Alternative: register_* calls do not return any segment ids information. Escape hatch is to list segments before/after and diff.

cc @nikolausWest

arf... as @jleibs noted below, registration is about (segment, layer), not just segment. (I somehow keep forgetting about this.) So this pleads in favour of not attempting to return such information from register_* or Tasks. How bad does that sounds from a user's prespective?

abey79 · 2025-11-24T08:47:07Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

    def register(self, recording_uri: str, *, recording_layer: str = "base", timeout_secs: int = 60) -> str:
        return self._inner.register(recording_uri, recording_layer=recording_layer, timeout_secs=timeout_secs)

+    # Do we really want/need a layer per URI?


that is (for now) the very definition of what a layer is

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

abey79 · 2025-11-24T08:50:31Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # I assume this registers everything in a s3 bucket?
+    # Why isn't this just the same call as `register`?


The FS analogy would be:

register -> single file

register_batch -> list of files

register_prefix -> entire directory (somehow directories are called prefix in object store)

I would submit the idea of merging these two APIs into one (using kwargs shenanigans) to the "pythonic review board"

decision:

we want:

register 1 segment with a given layer name (optionally)

register N segments with a single layer name (optionally) (use multiple calls if you want multiple layer names)

register all rrds under a given prefix with a single layer name (optionally)

all of these returns Tasks

"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much) (@abey79 to make a propose)

~~register -> register_segment/s for clarity~~ (these register a segment/layer combo, not a segment)

"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much)

@nikolausWest this feels like the opposite direction of the table API design where you pushed for 3 explicitly separate APIs (append/overwrite/upsert) instead of just having a single API and a write_mode parameter.

We need to fundamentally decide if we want a minimal number of APIs with complex parameters and behaviors, or a larger number of APIs with very specific clear functionality.

I am personally leaning toward the latter. e.g. in the table APIs a separate append_batches would make the docs and implementation of error handling more clear (none of this incompatible subsets of parameters business).

Discussing this with @ntjohnson1, having a single register call is ok because all 3 can be used to achieved the exact same outcome (at the the cost of a loop or so), whereas append/overwrite/upsert achieve distinct tasks. So there should be one-- and preferably only one --obvious way to do it would apply here.

Came on here to agree with @abey79 's last statement. I don't think this is the same argument as having different append/overwrite/upsert because these are exactly 1 concept with 3 different input types. Having different input types and automatically knowing how to handle them is one thing python should be good at.

Regarding three separate actions vs one action with multiple input types I was thinking the same as @abey79 and @timsaucer

I'm disallowing it. List of layer name is allowed only when uri is a list as well, and length must match.

I buy the argument for merging register and register batch. Promoting a single element to a list is fairly common. In both of them the total number of sub-tasks in the output is known and deterministic based on the input.

register_prefix, on the other hand, still feels like a legitimately odd-duck to me. It's actually not something where you can achieve the same thing with a loop. The user may not even have permissions for the s3 location they are trying to register. (For that matter the API itself also requires broader server permissions -- list vs read). The number of results is unknown when the user makes the call. The server behind the scenes also goes down a fairly distinct code path to scan the prefix first.

The user may not even have permissions for the s3 location they are trying to register

This seems more like a bug than a feature to me, but it does provide a weird edge case around the equivalence of these methods.

One thing that I mentioned to Antoine that might be captured elsewhere is if we make our Tasks/Jobs object nicer/more informative that might help to make it clearer their prefix is not aligned with their expectations. However, if they just use boto3 and rglob the prefix this same footgun is pretty easy to add a lot more data than you thought so pretty hard to design that away even with different method.

I share the opinion that register and register_batch are very similar, and register_prefix a bit less so. Seeing this being debated, I'm personally undecided about merging them all vs. keeping register + register_prefix.

If I had to make the call, I would go for register/register_prefix because, facing a doubt, I pick explicit over full-on kwargs.

abey79 · 2025-11-24T08:51:06Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # May be mistaking read in the imperative ("please index these ranges")
+    # No better suggestion though.
    def index_ranges(self, index: str | IndexColumnDescriptor) -> datafusion.DataFrame:


get_index_ranges might be a better name, yeah

This goes to the above discussion for get_ prefixes and the conflict between brevity and being explicit. I would not expect index_ranges to perform indexing.

@timsaucer I'm unsure about which direction you are trying to push.

As per my newfound "rule" of "skip get_ when it looks like a @property (but cannot be one)", this should have get_ as it takes a mandatory argument.

renamed in get_index_ranges in:

More update to the API sandbox #11969

abey79 · 2025-11-24T08:52:05Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

            base_tokenizer=base_tokenizer,
        )

+    # Not part of MVP. "Experimental - talk to sales"


Implement the search service #11954

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

abey79 · 2025-11-24T10:44:35Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

        )

+    # Is this only secondary indices? Should the name reflect that?
    def list_indexes(self) -> list:


decision:

create_secondary_index (kwargs to define which type)

list_secondary_indices

delete_secondary_indices

@ntjohnson1 proposal: "search index"

create_search_index

or create_fts_search_index and create_fts_search_index

list_search_indices

delete_search_indices

search

or fts_search/vector_search (docstring: "requires relevant search index")

Fair and more descriptive. Lets go with "search index"

mvp of that change implemented in:

More update to the API sandbox #11969

abey79 · 2025-11-24T10:44:46Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

    def search_vector(self, query: Any, column: Any, top_k: int) -> Any:
        return self._inner.search_vector(query, column, top_k)

+    # Should also return Job


...well. The grpc currently doest return tasks. I'm not sure how this is even done in cloud. I suggest we leave this aside for this project, as this feature is entirely orthogonal to everything else.

abey79 · 2025-11-24T10:45:42Z

rerun_py/tests/api_sandbox/test_draft/test_catalog_basics.py

 │ │ type: Utf8 ┆ type: i32  │ │
 │ ╞════════════╪════════════╡ │
-│ │ __entries  ┆ 3          │ │
+│ │ __entries  ┆ 3          │ │ # TODO(emilk): Can we remove __entries?


the entries table as a dataframe is essentially disappearing from the API, so it's ok for it to remain low-level

abey79 · 2025-11-24T10:48:16Z

rerun_py/tests/api_sandbox/test_draft/test_dataframe_api.py

+    # Should row_id be part of this?
+    # An API like `dataset_view.select_row_ids(row_ids)` would be the fastest way to fetch those rows of data.


This hinges on exposing the ability to run dataframe queries with row_id as index, which we currently don't expose (at least on python side). TBD, but probably out of scope for the API project.

jleibs · 2025-11-24T12:07:03Z

rerun_py/tests/api_sandbox/test_draft/test_dataframe_api.py

+    #         datetime.datetime(2000, 1, 1, 0, 0, 1, microsecond=500),
+    #     ],
+    #     fill_latest_at=True,
+    # ).concat(dataset_view.reader(


This makes sense for merging 1 or 2 views but seems impractical as a replacement for using_index_values on 1000s of segments.

jleibs · 2025-11-24T12:08:37Z

rerun_py/tests/api_sandbox/test_draft/test_dataframe_api.py

    # Create a view with all partitions
+    # TODO(Niko): Future API idea: "Sample the data at 20 Hz starting at this time point with latest-at"
+
+    # This is a cumbersome API. Can we do better?


Do you find the version that directly takes a DataFrame to be cumbersome?

FYI, we discussed that at length with Niko, and concluded that:

the current proposal is not great

it's not immediately obvious how to make it better

a lot of this discussion hinges on future features (time resample helper, alternate interpolation function, etc.)

we are unable to make a guess now as to how these future features will look like (aka chances are we'd make wrong choices)

=> we're OK to roll with the proposal in it's current state

Regarding the cumbersome comment. As Emil mentioned this was just some notes we wrote down between flights and not meant as a real full review on it's own. On first read through the combination of timeline + using_index_values + latest at seemed non-future proof and like there might be a more composable way to do it. One of the perceived problems was that we're likely to want more kinds of sampling methods that will force more kwargs. The other is that more fill methods will do the same. However, we weren't able to come up with something composable that was good (timeline, sampling, and fill methods belong together) and don't have enough clarity on the grounded needs for the sampling and fill methods to propose something better.

That's why we concluded it's best to keep it as is and then evolve this API as needed

jleibs · 2025-11-24T12:18:28Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # I assume this registers everything in a s3 bucket?
+    # Why isn't this just the same call as `register`?


"pythonic review board" to decide how far to merge register, register_batch and register_prefix (the 3 of them sound way too much)

@nikolausWest this feels like the opposite direction of the table API design where you pushed for 3 explicitly separate APIs (append/overwrite/upsert) instead of just having a single API and a write_mode parameter.

We need to fundamentally decide if we want a minimal number of APIs with complex parameters and behaviors, or a larger number of APIs with very specific clear functionality.

I am personally leaning toward the latter. e.g. in the table APIs a separate append_batches would make the docs and implementation of error handling more clear (none of this incompatible subsets of parameters business).

jleibs · 2025-11-24T12:22:46Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    #     …
+    # job.block_and_tqdm()
+
+    # Call these `register_segment` and `register_segments` for clarity


The challenge here is it's not a segment. It's a segment-layer (which would benefit from a better name). Several registrations can all add to the same segment.

ntjohnson1 · 2025-11-24T14:13:09Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

-    def blueprint_dataset_id(self) -> EntryId | None:
-        return self._inner.blueprint_dataset_id()
+    # Remove these blueprint stuff, and just have
+    def set_default_blueprint(self, blueprint: Blueprint | Path) -> None:


For the api can we just say set_blueprint? Because we either have the default blueprint or the heuristic on the viewer so this is the only configurable blueprint.

Formal proposal here:

https://linear.app/rerun/issue/RR-3018/improve-the-dataset-blueprint-apis-in-the-python-sdk

mvp implemented in:

More update to the API sandbox #11969

timsaucer · 2025-11-24T15:33:37Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # Why are _some_ functions prefixed with `get_` and not others?
+    # Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK


I think in general Python prefers brevity and Rust does not. So we have a case here were I suspect we are mirroring the internal APIs. I think we should remove the get_ prefixes where possible.

The current plan is:

def datasets() -> list[DatasetEntry]: ... def get_dataset(*, id, name) -> DatasetEntry: ...

This morning, I got a bit annoyed with the inconsistency. Why is there a get_ prefix in one and not in the other? Yet, I felt that get_datasets() is unnecessary verbose, and dataset(id, name) is not really pythonic, leading to a bit of cognitive dissonance.

I poked around a bit and ended up coming to peace with the status quo. Rationale:

datasets() doesn't need an argument (it actually has one optional arg now to show the hidden entries, but that's niche), so it looks and feels like a property. It cannot be a property given that network calls are involved, but it is one in spirit.

get_dataset() requires an argument, so it is unambiguously an accessor, for which the get_ prefix is extremely common in Python. The core inspiration here being dict: my_dict.get("item")

timsaucer · 2025-11-24T15:34:47Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # We have a bunch of potentially slow stuff here: register, create_fts_index, …
+    # What is the API we want?
+    # One option is to return a `Job` object with
+    # .block(), .id, .cancel()
+    # catalog.register(…).block(timeout=60)
+    # catalog.register(…) # non-blocking
+    # job = catalog.register(…)
+    # while not job.is_done():
+    #     …
+    # job.block_and_tqdm()


100% support this

timsaucer · 2025-11-24T15:36:59Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # I assume this registers everything in a s3 bucket?
+    # Why isn't this just the same call as `register`?


Came on here to agree with @abey79 's last statement. I don't think this is the same argument as having different append/overwrite/upsert because these are exactly 1 concept with 3 different input types. Having different input types and automatically knowing how to handle them is one thing python should be good at.

timsaucer · 2025-11-24T15:38:19Z

rerun_py/tests/api_sandbox/rerun_draft/catalog.py

+    # May be mistaking read in the imperative ("please index these ranges")
+    # No better suggestion though.
    def index_ranges(self, index: str | IndexColumnDescriptor) -> datafusion.DataFrame:


This goes to the above discussion for get_ prefixes and the conflict between brevity and being explicit. I would not expect index_ranges to perform indexing.

timsaucer · 2025-11-24T15:39:42Z

rerun_py/tests/api_sandbox/test_draft/test_catalog_basics.py

 id: fixed_size_binary[16] not null
 name: string not null
-entry_kind: int32 not null
+entry_kind: int32 not null  # Why is this int32 and not an enum? Poor py-arrow support?


The underlying data type is int32. We can create an extension type and then also update our rendering system to use those extension types.

I agree, but imo out of scope for this project.

timsaucer · 2025-11-24T15:40:16Z

rerun_py/tests/api_sandbox/test_draft/test_dataframe_api.py

+│ METADATA:          # TODO(emilk): can we skip this?                                                                                             │
+│ * version: 0.1.1   # TODO(emilk): can we skip this? Or call it `rerun_schema_version`?                                                          │


Yes, we can update the rendering. Easily done. The question is: what about other metatdata if the user adds it?

The main problem here is that version is unclear. What version is it referring to? Can the user add their own metadata here?

From the code, it should be sorbet:version. I have no idea why we only see version here. Something about the formatting?

edit: confirmed, it's indeed sorbet:version in the data. I'm not sure why this gets printed as version for str(datafusion.DataFrame). Maybe a bug @timsaucer? Related to the handling of extension types? (These involve : as well iirc.)

@timsaucer

DataFrame.__str__ truncates metadata keys that contain a : character apache/datafusion-python#1312

Added notes to the issue in the datafusion-python repo, but here is the offending source:

rerun/rerun_py/src/catalog/dataframe_rendering.rs

Line 160 in 128d37a

let batch_opts = RecordBatchFormatOpts::default();

The default options are to trim metadata, which is applied both at the header and for the columns.

Ah, right! I've been completely fooled by this machinery I wasn't aware of. Sorry for the noise.

Add food-for-thought around new query API

aad9199

abey79 reviewed Nov 24, 2025

View reviewed changes

rerun_py/tests/api_sandbox/rerun_draft/catalog.py Show resolved Hide resolved

abey79 reviewed Nov 24, 2025

View reviewed changes

rerun_py/tests/api_sandbox/rerun_draft/catalog.py Show resolved Hide resolved

abey79 reviewed Nov 24, 2025

View reviewed changes

rerun_py/tests/api_sandbox/rerun_draft/catalog.py Show resolved Hide resolved

abey79 reviewed Nov 24, 2025

View reviewed changes

rerun_py/tests/api_sandbox/rerun_draft/catalog.py Show resolved Hide resolved

abey79 reviewed Nov 24, 2025

View reviewed changes

jleibs reviewed Nov 24, 2025

View reviewed changes

ntjohnson1 reviewed Nov 24, 2025

View reviewed changes

timsaucer reviewed Nov 24, 2025

View reviewed changes

abey79 mentioned this pull request Nov 25, 2025

More update to the API sandbox #11969

Merged

		# Why are _some_ functions prefixed with `get_` and not others?
		# Is the logic "if it takes an argument"? Maybe this is Pythonic, IDK

		# wanted: def add_segments([recording])
		# requires default_s3_bucket somewhere

		# I assume this registers everything in a s3 bucket?
		# Why isn't this just the same call as `register`?

		# Should row_id be part of this?
		# An API like `dataset_view.select_row_ids(row_ids)` would be the fastest way to fetch those rows of data.

		│ METADATA: # TODO(emilk): can we skip this? │
		│ * version: 0.1.1 # TODO(emilk): can we skip this? Or call it `rerun_schema_version`? │

Add food-for-thought around new query API #11949

Are you sure you want to change the base?

Add food-for-thought around new query API #11949

Conversation

emilk commented Nov 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abey79 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abey79 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abey79 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abey79 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abey79 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abey79 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abey79 Nov 24, 2025 •

edited

Loading

abey79 Nov 24, 2025 •

edited

Loading

abey79 Nov 24, 2025 •

edited

Loading

abey79 Nov 24, 2025 •

edited

Loading

abey79 Nov 25, 2025 •

edited

Loading

abey79 Nov 24, 2025 •

edited

Loading