-
Notifications
You must be signed in to change notification settings - Fork 17
feat: narwhals-compliant dataframe setup #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is yet another attempt to bring dataframe api to substrait-python, also trying to bring the best of both worlds from previous approaches. PR adds DataFrame class that is compliant to narwhals LazyFrame API. narwhals is a lightweight datarame api layer that specifies the api and handles function routing. It's basically odbc (or should i say adbc) for dataframes, just the protocol and nothing else. That makes building a narwhals-compliant dataframe a lot simpler. Another upside of this approach is that we technically don't have to depend on narwhals at all, the api can be used either natively ( From implementation perspective, dataframe should also be a very lightweight glue code that delegates all the heavy lifting to plan and extended expression builders from |
|
thanks for your interest! we're revisiting the extensions/plugins mechanism, and my hope is that in a few months it will be significantly easier to build and test these - if it feels hard and confusing at the moment, there is light at the end of the tunnel! |
vbarua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR adds DataFrame class that is compliant to narwhals LazyFrame API. narwhals is a lightweight datarame api layer that specifies the api and handles function routing. It's basically odbc (or should i say adbc) for dataframes, just the protocol and nothing else. That makes building a narwhals-compliant dataframe a lot simpler.
Another upside of this approach is that we technically don't have to depend on narwhals at all, the api can be used either natively (import substrait.dataframe as sdf as namespace) or with narwhals (import narwhals as nw).
From implementation perspective, dataframe should also be a very lightweight glue code that delegates all the heavy lifting to plan and extended expression builders from substrait.builders.
This all seems great, and I think starting minimal and expanding support makes a ton of sense. Generally happy to merge as is, but if you could include a bit more context on the TODOs you've left I would appreciate it. I suspect you'll be the one to handling them anyways, but it's good not to assume.
| expressions = [e.expr for e in exprs] + [ | ||
| expr.alias(alias).expr for alias, expr in named_exprs.items() | ||
| ] | ||
| return DataFrame(project(self.plan, expressions=expressions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure to update this to use the new select builder once your work in #125 is merged in.
| return self | ||
|
|
||
| def abs(self): | ||
| self.expr = scalar_function("functions_arithmetic.yaml", "abs", expressions=[self.expr]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this PR will get merged before your work in #107. We'll need to make sure to update this.
| ] | ||
| return DataFrame(project(self.plan, expressions=expressions)) | ||
|
|
||
| # TODO handle version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's needs to exist to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no clue at this point, I need to dig deeper to understand if we can disregard it or not. that's what I meant by "handle version".
just to follow up on this, we've done this, and https://github.com/narwhals-dev/narwhals-daft should hopefully now serve as a useful blueprint! |
Adds minimal setup for narwhals-complaint dataframe (lazyframe).