-
Notifications
You must be signed in to change notification settings - Fork 18
feat: Add Polars DataFrame support #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request enhances the pyOpenMS-Viz library by adding support for both pandas and polars DataFrames. The documentation is updated with a new "Quick Start" section and example code, while a new Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant P as pyopenms_viz.plot
participant DF as DataFrameWrapper
participant PA as PlotAccessor
U->>P: Call plot(data, *args, **kwargs)
P->>DF: wrap_dataframe(data)
DF-->>P: Return wrapped DataFrame
P->>PA: Initialize PlotAccessor with wrapped data and parameters
PA-->>P: Provide plotting interface
P-->>U: Return interactive plot
Possibly related issues
Poem
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (13)
requirements.txt (1)
220-221
: Consider maintaining alphabetical order in requirements.txtFor better readability and maintainability, consider placing the polars dependency in alphabetical order or regenerating the file using pip-compile to maintain consistency.
The polars dependency could be placed between lines 126 and 127, following alphabetical ordering with the other packages.
README.md (1)
86-89
: Consider a more professional phrasing for contributionsThe language "feel free to" is commonly used but could be replaced with a more professional alternative.
-Contributions are welcome! Please feel free to submit a Pull Request. +Contributions are welcome! We encourage you to submit a Pull Request.🧰 Tools
🪛 LanguageTool
[style] ~88-~88: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ...ontributing Contributions are welcome! Please feel free to submit a Pull Request. ## License Thi...(FEEL_FREE_TO_STYLE_ME)
pyopenms_viz/_dataframe.py (6)
3-3
: Remove unused Optional type import.The
Optional
type is imported but never used in this file.-from typing import Any, Union, List, Optional +from typing import Any, Union, List🧰 Tools
🪛 Ruff (0.8.2)
3-3:
typing.Optional
imported but unusedRemove unused import:
typing.Optional
(F401)
43-48
: Simplify the get_column method implementation.Both branches of the conditional perform the same operation (converting the column to a numpy array). This can be simplified.
def get_column(self, col: str) -> np.ndarray: """Get a column as numpy array.""" - if self._is_polars: - return self._data[col].to_numpy() - return self._data[col].to_numpy() + return self._data[col].to_numpy()
75-80
: Simplify the max method implementation.Both branches of the conditional perform the same operation. This can be simplified.
def max(self, col: str) -> float: """Get maximum value of a column.""" - if self._is_polars: - return self._data[col].max() - return self._data[col].max() + return self._data[col].max()
81-86
: Simplify the min method implementation.Both branches of the conditional perform the same operation. This can be simplified.
def min(self, col: str) -> float: """Get minimum value of a column.""" - if self._is_polars: - return self._data[col].min() - return self._data[col].min() + return self._data[col].min()
93-93
: Rename unused loop variable.The loop variable
idx
is not used within the loop body. By convention, unused variables should be prefixed with an underscore.- for idx, row in self._data.iterrows(): + for _idx, row in self._data.iterrows():🧰 Tools
🪛 Ruff (0.8.2)
93-93: Loop control variable
idx
not used within loop bodyRename unused
idx
to_idx
(B007)
1-141
: Consider adding more comprehensive DataFrame operations.The current implementation covers basic DataFrame operations, but you might need to add more methods as the library evolves. Consider implementing additional common operations like:
- Filtering (e.g.,
df[df['column'] > value]
)- Joining/merging DataFrames
- Selection of multiple columns
- DataFrame shape property
- Row/column addition functions
Would you like me to suggest implementations for any of these additional methods?
🧰 Tools
🪛 Ruff (0.8.2)
3-3:
typing.Optional
imported but unusedRemove unused import:
typing.Optional
(F401)
93-93: Loop control variable
idx
not used within loop bodyRename unused
idx
to_idx
(B007)
132-132: Function definition does not bind loop variable
agg_func
(B023)
pyopenms_viz/_bokeh/core.py (1)
5-5
: Remove or utilize unused imports.Static analysis indicates that
Iterator
,Any
,Dict
,List
,Optional
,Union
,np
, andDataFrameWrapper
are imported but never used. Consider removing them to keep the code clean.-from typing import Tuple, Iterator, Any, Dict, List, Optional, Union +from typing import Tuple ... -import numpy as np ... -from .._dataframe import DataFrameWrapperAlso applies to: 24-24, 39-39
🧰 Tools
🪛 Ruff (0.8.2)
5-5:
typing.Iterator
imported but unusedRemove unused import
(F401)
5-5:
typing.Any
imported but unusedRemove unused import
(F401)
5-5:
typing.Dict
imported but unusedRemove unused import
(F401)
5-5:
typing.List
imported but unusedRemove unused import
(F401)
5-5:
typing.Optional
imported but unusedRemove unused import
(F401)
5-5:
typing.Union
imported but unusedRemove unused import
(F401)
pyopenms_viz/_matplotlib/core.py (1)
4-4
: Avoid importing unused symbols.Imports for
Any
,Dict
,List
,Optional
,Union
,re
,Line2D
,Figure
,np
,DataFrame
,Axes3D
, andDataFrameWrapper
seem unused per static analysis. Remove them if not needed.-from typing import Tuple, Any, Dict, List, Optional, Union +from typing import Tuple ... -import re ... -from matplotlib.lines import Line2D -from matplotlib.figure import Figure ... -import numpy as np -from pandas import DataFrame ... -from .._dataframe import DataFrameWrapperAlso applies to: 12-12, 13-13, 31-31
🧰 Tools
🪛 Ruff (0.8.2)
4-4:
typing.Any
imported but unusedRemove unused import
(F401)
4-4:
typing.Dict
imported but unusedRemove unused import
(F401)
4-4:
typing.List
imported but unusedRemove unused import
(F401)
4-4:
typing.Optional
imported but unusedRemove unused import
(F401)
4-4:
typing.Union
imported but unusedRemove unused import
(F401)
pyopenms_viz/_core.py (3)
34-35
: Remove or utilize these unused imports.
MarkerShapeGenerator
,is_latex_formatted
,PEAK_BOUNDARY_ICON
,FEATURE_BOUNDARY_ICON
,DataFrameWrapper
are flagged as unused.-from ._misc import ( - MarkerShapeGenerator, - is_latex_formatted, -) ... -from .constants import IS_SPHINX_BUILD, IS_NOTEBOOK, PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON ... -from ._dataframe import DataFrameType, DataFrameWrapper, wrap_dataframe +from ._dataframe import DataFrameType, wrap_dataframeAlso applies to: 37-37, 39-39
🧰 Tools
🪛 Ruff (0.8.2)
34-34:
._misc.MarkerShapeGenerator
imported but unusedRemove unused import
(F401)
35-35:
._misc.is_latex_formatted
imported but unusedRemove unused import
(F401)
1003-1003
: Simplify conditional expression.Use a direct Boolean expression:
-ascending=True if data.get_column(y).min() < 0 else False +ascending=data.get_column(y).min() < 0🧰 Tools
🪛 Ruff (0.8.2)
1003-1003: Remove unnecessary
True if ... else False
Remove unnecessary
True if ... else False
(SIM210)
1133-1133
: Avoid comparing boolean with True.As recommended by your static analysis, remove explicit equality:
-if self.bin_peaks == True or ( +if self.bin_peaks or (🧰 Tools
🪛 Ruff (0.8.2)
1133-1133: Avoid equality comparisons to
True
; useif self.bin_peaks:
for truth checksReplace with
self.bin_peaks
(E712)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
README.md
(2 hunks)pyopenms_viz/__init__.py
(2 hunks)pyopenms_viz/_bokeh/core.py
(5 hunks)pyopenms_viz/_core.py
(23 hunks)pyopenms_viz/_dataframe.py
(1 hunks)pyopenms_viz/_matplotlib/core.py
(3 hunks)pyopenms_viz/_plotly/core.py
(2 hunks)requirements.txt
(1 hunks)
🧰 Additional context used
🧬 Code Definitions (3)
pyopenms_viz/_bokeh/core.py (1)
pyopenms_viz/_dataframe.py (3) (3)
DataFrameWrapper
(10:105)data
(21:23)to_pandas
(25:29)
pyopenms_viz/_matplotlib/core.py (1)
pyopenms_viz/_dataframe.py (3) (3)
DataFrameWrapper
(10:105)data
(21:23)groupby
(56:60)
pyopenms_viz/_core.py (1)
pyopenms_viz/_dataframe.py (9) (9)
DataFrameWrapper
(10:105)wrap_dataframe
(139:141)data
(21:23)set_column
(49:54)get_column
(43:47)groupby
(56:60)max
(75:79)min
(81:85)columns
(97:101)
🪛 LanguageTool
README.md
[style] ~88-~88: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ...ontributing Contributions are welcome! Please feel free to submit a Pull Request. ## License Thi...
(FEEL_FREE_TO_STYLE_ME)
🪛 Ruff (0.8.2)
pyopenms_viz/_plotly/core.py
5-5: typing.List
imported but unused
Remove unused import
(F401)
5-5: typing.Union
imported but unused
Remove unused import
(F401)
5-5: typing.Any
imported but unused
Remove unused import
(F401)
5-5: typing.Dict
imported but unused
Remove unused import
(F401)
5-5: typing.Optional
imported but unused
Remove unused import
(F401)
31-31: .._dataframe.DataFrameWrapper
imported but unused
Remove unused import: .._dataframe.DataFrameWrapper
(F401)
pyopenms_viz/_dataframe.py
3-3: typing.Optional
imported but unused
Remove unused import: typing.Optional
(F401)
93-93: Loop control variable idx
not used within loop body
Rename unused idx
to _idx
(B007)
132-132: Function definition does not bind loop variable agg_func
(B023)
pyopenms_viz/_bokeh/core.py
5-5: typing.Iterator
imported but unused
Remove unused import
(F401)
5-5: typing.Any
imported but unused
Remove unused import
(F401)
5-5: typing.Dict
imported but unused
Remove unused import
(F401)
5-5: typing.List
imported but unused
Remove unused import
(F401)
5-5: typing.Optional
imported but unused
Remove unused import
(F401)
5-5: typing.Union
imported but unused
Remove unused import
(F401)
24-24: numpy
imported but unused
Remove unused import: numpy
(F401)
39-39: .._dataframe.DataFrameWrapper
imported but unused
Remove unused import: .._dataframe.DataFrameWrapper
(F401)
pyopenms_viz/_matplotlib/core.py
4-4: typing.Any
imported but unused
Remove unused import
(F401)
4-4: typing.Dict
imported but unused
Remove unused import
(F401)
4-4: typing.List
imported but unused
Remove unused import
(F401)
4-4: typing.Optional
imported but unused
Remove unused import
(F401)
4-4: typing.Union
imported but unused
Remove unused import
(F401)
5-5: re
imported but unused
Remove unused import: re
(F401)
8-8: matplotlib.lines.Line2D
imported but unused
Remove unused import: matplotlib.lines.Line2D
(F401)
11-11: matplotlib.figure.Figure
imported but unused
Remove unused import: matplotlib.figure.Figure
(F401)
12-12: numpy
imported but unused
Remove unused import: numpy
(F401)
13-13: pandas.DataFrame
imported but unused
Remove unused import: pandas.DataFrame
(F401)
14-14: mpl_toolkits.mplot3d.Axes3D
imported but unused
Remove unused import: mpl_toolkits.mplot3d.Axes3D
(F401)
31-31: .._dataframe.DataFrameWrapper
imported but unused
Remove unused import: .._dataframe.DataFrameWrapper
(F401)
pyopenms_viz/_core.py
34-34: ._misc.MarkerShapeGenerator
imported but unused
Remove unused import
(F401)
35-35: ._misc.is_latex_formatted
imported but unused
Remove unused import
(F401)
37-37: .constants.PEAK_BOUNDARY_ICON
imported but unused
Remove unused import
(F401)
37-37: .constants.FEATURE_BOUNDARY_ICON
imported but unused
Remove unused import
(F401)
39-39: ._dataframe.DataFrameWrapper
imported but unused
Remove unused import: ._dataframe.DataFrameWrapper
(F401)
1003-1003: Remove unnecessary True if ... else False
Remove unnecessary True if ... else False
(SIM210)
1133-1133: Avoid equality comparisons to True
; use if self.bin_peaks:
for truth checks
Replace with self.bin_peaks
(E712)
🔇 Additional comments (18)
requirements.txt (1)
220-220
:✅ Verification successful
Polars dependency added to meet feature requirements.
The addition of
polars>=0.20.7
is appropriate for enabling the new Polars DataFrame support functionality.Let's verify the status of the specified polars version:
🏁 Script executed:
#!/bin/bash # Check current polars version on PyPI and latest release date LATEST=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.info.version') RELEASE_DATE=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.releases["'$LATEST'"][0].upload_time') echo "Requested version: >=0.20.7" echo "Latest version on PyPI: $LATEST" echo "Latest release date: $RELEASE_DATE" # Check if 0.20.7 is a valid version VERSION_EXISTS=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.releases | has("0.20.7")') echo "Version 0.20.7 exists: $VERSION_EXISTS" # Check minimum Python version required MIN_PYTHON=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.info.requires_python') echo "Minimum Python required: $MIN_PYTHON"Length of output: 773
Polars Dependency Verification Passed
The verification confirms that the dependency
polars>=0.20.7
is valid:
- Version 0.20.7 exists on PyPI.
- The latest available version is 1.25.2.
- The minimum required Python version is >=3.9.
The addition of this dependency is appropriate for enabling the new Polars DataFrame support functionality.
README.md (2)
23-25
: Clear feature announcements enhance user understanding.These new bullet points effectively communicate the key enhancements of supporting both pandas and polars DataFrames, along with interactive plot capabilities and customization options.
64-85
: Excellent Quick Start examples for both pandas and polars.The added Quick Start section provides clear, concise examples demonstrating how to use both pandas and polars DataFrames with the new plot function. This will help users quickly understand how to leverage the new functionality.
pyopenms_viz/_dataframe.py (3)
8-8
: Good type alias for DataFrame support.Creating a
DataFrameType
type alias for the union of pandas and polars DataFrames enhances code readability and maintainability.
10-42
: Well-structured DataFrameWrapper class with clear method implementations.The DataFrameWrapper class provides a unified interface for handling both pandas and polars DataFrames, with clear methods for conversion, copying, and property access.
139-141
: Well-defined utility function for wrapping DataFrames.The
wrap_dataframe
function provides a clean interface for creating aDataFrameWrapper
instance from either a pandas or polars DataFrame.pyopenms_viz/__init__.py (3)
5-9
: Good job introducing Polars imports.These new imports for
typing
,pandas
, andpolars
set the stage for dual DataFrame support. Just ensure your documentation clarifies that Polars is now an additional dependency.
11-11
: Confirm usage ofPlotAccessor
.You’ve added an import for
PlotAccessor
from pandas plotting. Verify that you truly intend to rely on Pandas’ internalPlotAccessor
. If this is only for type references or if you rely on your custom logic, consider removing it to avoid confusion.
205-224
: Introducing a unifiedplot
function looks great.This function clearly allows passing either a Pandas or Polars DataFrame. The interface is flexible, and the docstring is well-documented. Consider adding a compatibility check or fallback in case Polars is not installed.
pyopenms_viz/_bokeh/core.py (2)
299-299
: Usingto_pandas()
on line data.Converting your wrapper object to a Pandas DataFrame is valid for Bokeh's
ColumnDataSource
. Ensure that any polars DataFrame is also properly converted to Pandas at this step, which it is, thanks to the wrapper’sto_pandas
method.
310-310
: Likewise converting grouped data withto_pandas()
.This usage is consistent for each grouped subset. Looks good.
pyopenms_viz/_matplotlib/core.py (1)
314-314
: Group-by ordering logic.The removal of
sort=True
is fine to preserve the data’s natural order. If stable grouping is required, document that or reintroduce an explicit sorting step.pyopenms_viz/_core.py (6)
133-133
: Constructor now wraps any DataFrame.Using
wrap_dataframe()
is a clean approach for unified Pandas/Polars handling. Good move.
151-151
: Casting grouping column to string.Converting
self.by
column data to string helps avoid issues with non-string grouping keys. Looks sensible.
183-183
: Integer column index check.Well done verifying the user’s integer-based column references. This prevents out-of-range or numeric type confusion.
235-238
: Aggregation logic for duplicate data.This handling is aligned with your
aggregate_duplicates
flag. The approach is clear and warns the user if duplicates remain.
299-299
: Duplicate data check in reference spectrum.Ensuring your reference spectrum also respects
aggregate_duplicates
is a consistent approach.
304-308
: Aggregation for reference spectrum.Grouping by known columns, summing, and resetting the index for your reference data is consistent with the main data’s logic.
@@ -2,7 +2,7 @@ | |||
|
|||
from abc import ABC | |||
|
|||
from typing import List, Tuple, Union | |||
from typing import List, Tuple, Union, Any, Dict, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imported type hints are not used in this file.
The added type hints Any
, Dict
, and Optional
are imported but not used in this file, which could potentially cause linting issues.
-from typing import List, Tuple, Union, Any, Dict, Optional
+from typing import List, Tuple, Union
Only add these imports if they'll be used in upcoming changes to this file.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import List, Tuple, Union, Any, Dict, Optional | |
from typing import List, Tuple, Union |
🧰 Tools
🪛 Ruff (0.8.2)
5-5: typing.List
imported but unused
Remove unused import
(F401)
5-5: typing.Union
imported but unused
Remove unused import
(F401)
5-5: typing.Any
imported but unused
Remove unused import
(F401)
5-5: typing.Dict
imported but unused
Remove unused import
(F401)
5-5: typing.Optional
imported but unused
Remove unused import
(F401)
@@ -28,6 +28,7 @@ | |||
from .._config import bokeh_line_dash_mapper | |||
from .._misc import ColorGenerator, MarkerShapeGenerator, is_latex_formatted | |||
from ..constants import PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON | |||
from .._dataframe import DataFrameWrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imported DataFrameWrapper is not used in this file.
The DataFrameWrapper
class is imported but not used in this file, which could cause linting issues.
-from .._dataframe import DataFrameWrapper
Only add this import if it will be used in this file. If the intent is to prepare for future changes where the DataFrame handling will be updated to use the wrapper, consider adding a comment explaining this.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from .._dataframe import DataFrameWrapper |
🧰 Tools
🪛 Ruff (0.8.2)
31-31: .._dataframe.DataFrameWrapper
imported but unused
Remove unused import: .._dataframe.DataFrameWrapper
(F401)
def agg(self, func: dict) -> DataFrameWrapper: | ||
"""Aggregate using the specified functions.""" | ||
if self._is_polars: | ||
agg_exprs = [] | ||
for col, agg_func in func.items(): | ||
if isinstance(agg_func, str): | ||
agg_exprs.append(pl.col(col).agg(agg_func)) | ||
else: | ||
agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy()))) | ||
result = self._grouped.agg(agg_exprs) | ||
return DataFrameWrapper(result) | ||
else: | ||
return DataFrameWrapper(self._grouped.agg(func)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential closure issue in agg method.
The agg_func
variable is used inside a lambda function within a loop, which could lead to unexpected behavior due to late binding.
def agg(self, func: dict) -> DataFrameWrapper:
"""Aggregate using the specified functions."""
if self._is_polars:
agg_exprs = []
for col, agg_func in func.items():
if isinstance(agg_func, str):
agg_exprs.append(pl.col(col).agg(agg_func))
else:
- agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
+ # Create a closure to capture the current value of agg_func
+ agg_exprs.append(pl.col(col).agg(lambda x, func=agg_func: func(x.to_numpy())))
result = self._grouped.agg(agg_exprs)
return DataFrameWrapper(result)
else:
return DataFrameWrapper(self._grouped.agg(func))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def agg(self, func: dict) -> DataFrameWrapper: | |
"""Aggregate using the specified functions.""" | |
if self._is_polars: | |
agg_exprs = [] | |
for col, agg_func in func.items(): | |
if isinstance(agg_func, str): | |
agg_exprs.append(pl.col(col).agg(agg_func)) | |
else: | |
agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy()))) | |
result = self._grouped.agg(agg_exprs) | |
return DataFrameWrapper(result) | |
else: | |
return DataFrameWrapper(self._grouped.agg(func)) | |
def agg(self, func: dict) -> DataFrameWrapper: | |
"""Aggregate using the specified functions.""" | |
if self._is_polars: | |
agg_exprs = [] | |
for col, agg_func in func.items(): | |
if isinstance(agg_func, str): | |
agg_exprs.append(pl.col(col).agg(agg_func)) | |
else: | |
# Create a closure to capture the current value of agg_func | |
agg_exprs.append(pl.col(col).agg(lambda x, func=agg_func: func(x.to_numpy()))) | |
result = self._grouped.agg(agg_exprs) | |
return DataFrameWrapper(result) | |
else: | |
return DataFrameWrapper(self._grouped.agg(func)) |
🧰 Tools
🪛 Ruff (0.8.2)
132-132: Function definition does not bind loop variable agg_func
(B023)
def iterrows(self): | ||
"""Iterate over DataFrame rows.""" | ||
if self._is_polars: | ||
for row in self._data.iter_rows(named=True): | ||
yield row | ||
else: | ||
for idx, row in self._data.iterrows(): | ||
yield row | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistency in iterrows behavior.
The pandas iterrows()
method returns a tuple of (index, row), but the implementation here returns only the row for both pandas and polars DataFrames. This inconsistency could lead to unexpected behavior.
def iterrows(self):
"""Iterate over DataFrame rows."""
if self._is_polars:
for row in self._data.iter_rows(named=True):
- yield row
+ yield None, row # Return (None, row) to match pandas pattern
else:
for idx, row in self._data.iterrows():
- yield row
+ yield idx, row # Return (idx, row) as pandas does
Alternatively, if you want to standardize on returning just the row, update the docstring to clarify this departure from the pandas behavior.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def iterrows(self): | |
"""Iterate over DataFrame rows.""" | |
if self._is_polars: | |
for row in self._data.iter_rows(named=True): | |
yield row | |
else: | |
for idx, row in self._data.iterrows(): | |
yield row | |
def iterrows(self): | |
"""Iterate over DataFrame rows.""" | |
if self._is_polars: | |
for row in self._data.iter_rows(named=True): | |
yield None, row # Return (None, row) to match pandas pattern | |
else: | |
for idx, row in self._data.iterrows(): | |
yield idx, row # Return (idx, row) as pandas does |
🧰 Tools
🪛 Ruff (0.8.2)
93-93: Loop control variable idx
not used within loop body
Rename unused idx
to _idx
(B007)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good first step. This seems to implement a similar approach as PR 49. However, I'm still not sure if it's the best approach.
- Now we will require two dependencies for dataframes, pandas and polars. So a user will have to install both of these, but maybe they only want one or the other?
- Under the hood, for certain operations (i.e. bokeh source data), the data gets explicitly converted to a pandas df.
- Because there is the
to_pandas
method, I'm not sure if we need the other methods in_dataframe.py
, we could just directly convert the polars df to pandas df in the initplot method
, then we wouldn't need to worry about the downstream stuff using pandas manipulations? - We still would like the option to perform method chaining. For your example in the README, it seems like we need to call
plot(df, ...)
.
My only real concern/issue is the need for having to install two dependencies for pandas and polars. Do you think it would be possible to implement it in a way where polars + pyopenms_viz would not require pandas at all? Could we refactor/re-write some of the underlying plot methods that use pandas manipulations to make them agnostic to the type of dataframe?
@jcharkow, what are you thoughts?
df = pd.DataFrame({ | ||
'mz': [100, 200, 300], | ||
'intensity': [1000, 2000, 3000] | ||
}) | ||
plot(df, x='mz', y='intensity', kind='spectrum') | ||
|
||
# Using polars DataFrame | ||
df_pl = pl.DataFrame({ | ||
'mz': [100, 200, 300], | ||
'intensity': [1000, 2000, 3000] | ||
}) | ||
plot(df_pl, x='mz', y='intensity', kind='spectrum') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user have to explicitly call the plot method? The nice aspect of pyopenms_viz is that it was designed to be compatible with pandas plot method, so that you can perform object/method chaining df.plot(...)
. This example doesn't really show that. Can the same be done for the polars df example?
|
||
DataFrameType = Union[pd.DataFrame, pl.DataFrame] | ||
|
||
class DataFrameWrapper: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the initial PR 49 suggestion:
One minor thing: maybe consider using names that match to common design pattern terminology.
Using DataFrameInterface and PandasDataFrameAdapter instead of e.g. wrapper
def get_column(self, col: str) -> np.ndarray: | ||
"""Get a column as numpy array.""" | ||
if self._is_polars: | ||
return self._data[col].to_numpy() | ||
return self._data[col].to_numpy() | ||
|
||
def set_column(self, col: str, value: Any) -> None: | ||
"""Set a column value.""" | ||
if self._is_polars: | ||
self._data = self._data.with_columns(pl.Series(col, value)) | ||
else: | ||
self._data[col] = value | ||
|
||
def groupby(self, by: str) -> 'GroupByWrapper': | ||
"""Group the DataFrame by a column.""" | ||
if self._is_polars: | ||
return GroupByWrapper(self._data.groupby(by), is_polars=True) | ||
return GroupByWrapper(self._data.groupby(by), is_polars=False) | ||
|
||
def fillna(self, value: Any) -> 'DataFrameWrapper': | ||
"""Fill NA/null values.""" | ||
if self._is_polars: | ||
return DataFrameWrapper(self._data.fill_null(value)) | ||
return DataFrameWrapper(self._data.fillna(value)) | ||
|
||
def between(self, col: str, left: float, right: float) -> 'DataFrameWrapper': | ||
"""Select rows where column values are between left and right.""" | ||
if self._is_polars: | ||
mask = (self._data[col] >= left) & (self._data[col] <= right) | ||
return DataFrameWrapper(self._data.filter(mask)) | ||
return DataFrameWrapper(self._data[self._data[col].between(left, right)]) | ||
|
||
def max(self, col: str) -> float: | ||
"""Get maximum value of a column.""" | ||
if self._is_polars: | ||
return self._data[col].max() | ||
return self._data[col].max() | ||
|
||
def min(self, col: str) -> float: | ||
"""Get minimum value of a column.""" | ||
if self._is_polars: | ||
return self._data[col].min() | ||
return self._data[col].min() | ||
|
||
def iterrows(self): | ||
"""Iterate over DataFrame rows.""" | ||
if self._is_polars: | ||
for row in self._data.iter_rows(named=True): | ||
yield row | ||
else: | ||
for idx, row in self._data.iterrows(): | ||
yield row | ||
|
||
@property | ||
def columns(self) -> List[str]: | ||
"""Get column names.""" | ||
if self._is_polars: | ||
return self._data.columns | ||
return list(self._data.columns) | ||
|
||
def __getitem__(self, key: str) -> np.ndarray: | ||
"""Get a column by name.""" | ||
return self.get_column(key) | ||
|
||
|
||
class GroupByWrapper: | ||
"""Wrapper for grouped DataFrame operations.""" | ||
|
||
def __init__(self, grouped, is_polars: bool): | ||
self._grouped = grouped | ||
self._is_polars = is_polars | ||
|
||
def __iter__(self): | ||
"""Iterate over groups.""" | ||
if self._is_polars: | ||
for name, group in self._grouped.groups(): | ||
yield name, DataFrameWrapper(group) | ||
else: | ||
for name, group in self._grouped: | ||
yield name, DataFrameWrapper(group) | ||
|
||
def agg(self, func: dict) -> DataFrameWrapper: | ||
"""Aggregate using the specified functions.""" | ||
if self._is_polars: | ||
agg_exprs = [] | ||
for col, agg_func in func.items(): | ||
if isinstance(agg_func, str): | ||
agg_exprs.append(pl.col(col).agg(agg_func)) | ||
else: | ||
agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy()))) | ||
result = self._grouped.agg(agg_exprs) | ||
return DataFrameWrapper(result) | ||
else: | ||
return DataFrameWrapper(self._grouped.agg(func)) | ||
|
||
|
||
def wrap_dataframe(data: DataFrameType) -> DataFrameWrapper: | ||
"""Create a DataFrameWrapper instance from a pandas or polars DataFrame.""" | ||
return DataFrameWrapper(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is a to_pandas
method, I am wondering if these other methods are needed? Would it just be better to just convert the data to pandas upfront in the plotting init method, then nothing else downstream would require changing?
Pull Request description
feat: Add Polars DataFrame support
Which issue this PR aims to resolve or fix?
Solve #55
How to test these changes
Pull Request checklists
This PR is a:
About this PR:
Author's checklist:
complexity.
Additional information
Reviewer's checklist
Copy and paste this template for your review's note:
Summary by CodeRabbit
New Features
Documentation