feat: Add Polars DataFrame support #71

SanjayUG · 2025-03-18T12:11:52Z

Pull Request description

feat: Add Polars DataFrame support

Create DataFrame abstraction layer for pandas/polars compatibility
Update plotting backends to handle both DataFrame types
Add polars dependency and documentation
Maintain existing API and functionality

Which issue this PR aims to resolve or fix?

Solve #55

How to test these changes

Ensure you have both pandas and polars installed
Run existing tests to ensure compatibility with both DataFrame types
Verify that plotting functions work with both pandas and polars DataFrames

Pull Request checklists

This PR is a:

bug-fix
new feature
maintenance

About this PR:

it includes tests.
the tests are executed on CI.
the tests generate log file(s) (path).
pre-commit hooks were executed locally.
this PR requires a project documentation update.

Author's checklist:

I have reviewed the changes and it contains no misspelling.
The code is well commented, especially in the parts that contain more
complexity.
New and old tests passed locally.

Additional information

Reviewer's checklist

Copy and paste this template for your review's note:

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch
- [ ] I managed to test the new changes locally
- [ ] I confirm that the issues mentioned were fixed/resolved.

Summary by CodeRabbit

New Features
- Expanded support for both pandas and polars DataFrames, offering greater flexibility for data visualization.
- Enhanced interactive plotting with intuitive zoom, pan, and hover functionalities, plus improved customizable styling and annotations.
Documentation
- Introduced a new "Quick Start" guide featuring example code to streamline the onboarding process for new users.

coderabbitai · 2025-03-18T12:12:01Z

Walkthrough

This pull request enhances the pyOpenMS-Viz library by adding support for both pandas and polars DataFrames. The documentation is updated with a new "Quick Start" section and example code, while a new plot function is introduced to accept either DataFrame type. Internal modifications improve data handling by refactoring column accesses through dedicated methods and wrapping DataFrames using a unified interface. Additionally, the changes update type hints and imports in various visualization backends and add a new dependency on the polars library.

Changes

File(s)	Change Summary
README.md	- Added a new "Quick Start" section with code examples showcasing pandas and polars usage. - Updated documentation to highlight interactive plot features and customizable styling.
pyopenms_viz/init.py	- Introduced a new `plot` function supporting both pandas and polars DataFrames. - Enhanced the function with detailed docstrings and flexible argument handling through PlotAccessor.
pyopenms_viz/_bokeh/core.py, pyopenms_viz/_matplotlib/core.py, pyopenms_viz/_plotly/core.py	- Updated import statements to include extended type hints and new libraries (e.g., numpy, DataFrameWrapper). - In Bokeh, data is now converted with `.to_pandas()` for ColumnDataSource compatibility; Matplotlib’s groupby call has been adjusted.
pyopenms_viz/_core.py	- Modified the BasePlot class to accept a union of DataFrame types using `wrap_dataframe`. - Replaced direct column access with `get_column` and `set_column` methods. - Added new imports for enhanced visualization handling.
pyopenms_viz/_dataframe.py	- Added a new file providing the `DataFrameWrapper` and `GroupByWrapper` classes, along with the `wrap_dataframe` function, offering a unified interface for both pandas and polars DataFrames.
requirements.txt	- Added a dependency: `polars>=0.20.7`.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant P as pyopenms_viz.plot
    participant DF as DataFrameWrapper
    participant PA as PlotAccessor

    U->>P: Call plot(data, *args, **kwargs)
    P->>DF: wrap_dataframe(data)
    DF-->>P: Return wrapped DataFrame
    P->>PA: Initialize PlotAccessor with wrapped data and parameters
    PA-->>P: Provide plotting interface
    P-->>U: Return interactive plot

Possibly related issues

Add Support for Polars Dataframes #55: The PR introduces support for polars DataFrames in the plot function, addressing the need for a unified interface as described in the issue.

Poem

I'm a rabbit, hopping with glee,
New plots and wrappers set my code free.
Pandas and polars now side by side,
Quick Start carrots guide the ride.
In fields of data, I dance and play—
Celebrating changes in a joyful way!
🐇✨

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)

We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
- To enable this feature, set early_access to true under in the settings.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (13)

requirements.txt (1)

220-221: Consider maintaining alphabetical order in requirements.txt

For better readability and maintainability, consider placing the polars dependency in alphabetical order or regenerating the file using pip-compile to maintain consistency.

The polars dependency could be placed between lines 126 and 127, following alphabetical ordering with the other packages.
README.md (1)
86-89: Consider a more professional phrasing for contributions

The language "feel free to" is commonly used but could be replaced with a more professional alternative.
-Contributions are welcome! Please feel free to submit a Pull Request.
+Contributions are welcome! We encourage you to submit a Pull Request.
🧰 Tools

🪛 LanguageTool

[style] ~88-~88: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ...ontributing Contributions are welcome! Please feel free to submit a Pull Request. ## License Thi...

(FEEL_FREE_TO_STYLE_ME)
pyopenms_viz/_dataframe.py (6)
3-3: Remove unused Optional type import.

The Optional type is imported but never used in this file.
-from typing import Any, Union, List, Optional
+from typing import Any, Union, List
🧰 Tools

🪛 Ruff (0.8.2)

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

43-48: Simplify the get_column method implementation.

Both branches of the conditional perform the same operation (converting the column to a numpy array). This can be simplified.
def get_column(self, col: str) -> np.ndarray:
    """Get a column as numpy array."""
-    if self._is_polars:
-        return self._data[col].to_numpy()
-    return self._data[col].to_numpy()
+    return self._data[col].to_numpy()
75-80: Simplify the max method implementation.

Both branches of the conditional perform the same operation. This can be simplified.
def max(self, col: str) -> float:
    """Get maximum value of a column."""
-    if self._is_polars:
-        return self._data[col].max()
-    return self._data[col].max()
+    return self._data[col].max()
81-86: Simplify the min method implementation.

Both branches of the conditional perform the same operation. This can be simplified.
def min(self, col: str) -> float:
    """Get minimum value of a column."""
-    if self._is_polars:
-        return self._data[col].min()
-    return self._data[col].min()
+    return self._data[col].min()
93-93: Rename unused loop variable.

The loop variable idx is not used within the loop body. By convention, unused variables should be prefixed with an underscore.
-        for idx, row in self._data.iterrows():
+        for _idx, row in self._data.iterrows():
🧰 Tools

🪛 Ruff (0.8.2)

93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

1-141: Consider adding more comprehensive DataFrame operations.

The current implementation covers basic DataFrame operations, but you might need to add more methods as the library evolves. Consider implementing additional common operations like:

Filtering (e.g., df[df['column'] > value])

Joining/merging DataFrames

Selection of multiple columns

DataFrame shape property

Row/column addition functions

Would you like me to suggest implementations for any of these additional methods?

🧰 Tools

🪛 Ruff (0.8.2)

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

132-132: Function definition does not bind loop variable agg_func

(B023)
pyopenms_viz/_bokeh/core.py (1)
5-5: Remove or utilize unused imports.

Static analysis indicates that Iterator, Any, Dict, List, Optional, Union, np, and DataFrameWrapper are imported but never used. Consider removing them to keep the code clean.
-from typing import Tuple, Iterator, Any, Dict, List, Optional, Union
+from typing import Tuple
 ...
-import numpy as np
 ...
-from .._dataframe import DataFrameWrapper
Also applies to: 24-24, 39-39

🧰 Tools

🪛 Ruff (0.8.2)

5-5: typing.Iterator imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.List imported but unused

Remove unused import

(F401)

5-5: typing.Optional imported but unused

Remove unused import

(F401)

5-5: typing.Union imported but unused

Remove unused import

(F401)
pyopenms_viz/_matplotlib/core.py (1)
4-4: Avoid importing unused symbols.

Imports for Any, Dict, List, Optional, Union, re, Line2D, Figure, np, DataFrame, Axes3D, and DataFrameWrapper seem unused per static analysis. Remove them if not needed.
-from typing import Tuple, Any, Dict, List, Optional, Union
+from typing import Tuple
 ...
-import re
 ...
-from matplotlib.lines import Line2D
-from matplotlib.figure import Figure
 ...
-import numpy as np
-from pandas import DataFrame
 ...
-from .._dataframe import DataFrameWrapper
Also applies to: 12-12, 13-13, 31-31

🧰 Tools

🪛 Ruff (0.8.2)

4-4: typing.Any imported but unused

Remove unused import

(F401)

4-4: typing.Dict imported but unused

Remove unused import

(F401)

4-4: typing.List imported but unused

Remove unused import

(F401)

4-4: typing.Optional imported but unused

Remove unused import

(F401)

4-4: typing.Union imported but unused

Remove unused import

(F401)
pyopenms_viz/_core.py (3)
34-35: Remove or utilize these unused imports.

MarkerShapeGenerator, is_latex_formatted, PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON, DataFrameWrapper are flagged as unused.
-from ._misc import (
-    MarkerShapeGenerator,
-    is_latex_formatted,
-)
 ...
-from .constants import IS_SPHINX_BUILD, IS_NOTEBOOK, PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON
 ...
-from ._dataframe import DataFrameType, DataFrameWrapper, wrap_dataframe
+from ._dataframe import DataFrameType, wrap_dataframe
Also applies to: 37-37, 39-39

🧰 Tools

🪛 Ruff (0.8.2)

34-34: ._misc.MarkerShapeGenerator imported but unused

Remove unused import

(F401)

35-35: ._misc.is_latex_formatted imported but unused

Remove unused import

(F401)

1003-1003: Simplify conditional expression.

Use a direct Boolean expression:
-ascending=True if data.get_column(y).min() < 0 else False
+ascending=data.get_column(y).min() < 0
🧰 Tools

🪛 Ruff (0.8.2)

1003-1003: Remove unnecessary True if ... else False

Remove unnecessary True if ... else False

(SIM210)

1133-1133: Avoid comparing boolean with True.

As recommended by your static analysis, remove explicit equality:
-if self.bin_peaks == True or (
+if self.bin_peaks or (
🧰 Tools

🪛 Ruff (0.8.2)

1133-1133: Avoid equality comparisons to True; use if self.bin_peaks: for truth checks

Replace with self.bin_peaks

(E712)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40549e8 and 06cfaec.

📒 Files selected for processing (8)

README.md (2 hunks)
pyopenms_viz/__init__.py (2 hunks)
pyopenms_viz/_bokeh/core.py (5 hunks)
pyopenms_viz/_core.py (23 hunks)
pyopenms_viz/_dataframe.py (1 hunks)
pyopenms_viz/_matplotlib/core.py (3 hunks)
pyopenms_viz/_plotly/core.py (2 hunks)
requirements.txt (1 hunks)

🧰 Additional context used

🧬 Code Definitions (3)

pyopenms_viz/_bokeh/core.py (1)

pyopenms_viz/_dataframe.py (3) (3)

DataFrameWrapper (10:105)

data (21:23)

to_pandas (25:29)

pyopenms_viz/_matplotlib/core.py (1)

pyopenms_viz/_dataframe.py (3) (3)

DataFrameWrapper (10:105)

data (21:23)

groupby (56:60)

pyopenms_viz/_core.py (1)

pyopenms_viz/_dataframe.py (9) (9)

DataFrameWrapper (10:105)

wrap_dataframe (139:141)

data (21:23)

set_column (49:54)

get_column (43:47)

groupby (56:60)

max (75:79)

min (81:85)

columns (97:101)

🪛 LanguageTool

README.md

[style] ~88-~88: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ...ontributing Contributions are welcome! Please feel free to submit a Pull Request. ## License Thi...

(FEEL_FREE_TO_STYLE_ME)

🪛 Ruff (0.8.2)

pyopenms_viz/_plotly/core.py

5-5: typing.List imported but unused

Remove unused import

(F401)

5-5: typing.Union imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Optional imported but unused

Remove unused import

(F401)

31-31: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

pyopenms_viz/_dataframe.py

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

132-132: Function definition does not bind loop variable agg_func

(B023)

pyopenms_viz/_bokeh/core.py

5-5: typing.Iterator imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.List imported but unused

Remove unused import

(F401)

5-5: typing.Optional imported but unused

Remove unused import

(F401)

5-5: typing.Union imported but unused

Remove unused import

(F401)

24-24: numpy imported but unused

Remove unused import: numpy

(F401)

39-39: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

pyopenms_viz/_matplotlib/core.py

4-4: typing.Any imported but unused

Remove unused import

(F401)

4-4: typing.Dict imported but unused

Remove unused import

(F401)

4-4: typing.List imported but unused

Remove unused import

(F401)

4-4: typing.Optional imported but unused

Remove unused import

(F401)

4-4: typing.Union imported but unused

Remove unused import

(F401)

5-5: re imported but unused

Remove unused import: re

(F401)

8-8: matplotlib.lines.Line2D imported but unused

Remove unused import: matplotlib.lines.Line2D

(F401)

11-11: matplotlib.figure.Figure imported but unused

Remove unused import: matplotlib.figure.Figure

(F401)

12-12: numpy imported but unused

Remove unused import: numpy

(F401)

13-13: pandas.DataFrame imported but unused

Remove unused import: pandas.DataFrame

(F401)

14-14: mpl_toolkits.mplot3d.Axes3D imported but unused

Remove unused import: mpl_toolkits.mplot3d.Axes3D

(F401)

31-31: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

pyopenms_viz/_core.py

34-34: ._misc.MarkerShapeGenerator imported but unused

Remove unused import

(F401)

35-35: ._misc.is_latex_formatted imported but unused

Remove unused import

(F401)

37-37: .constants.PEAK_BOUNDARY_ICON imported but unused

Remove unused import

(F401)

37-37: .constants.FEATURE_BOUNDARY_ICON imported but unused

Remove unused import

(F401)

39-39: ._dataframe.DataFrameWrapper imported but unused

Remove unused import: ._dataframe.DataFrameWrapper

(F401)

1003-1003: Remove unnecessary True if ... else False

Remove unnecessary True if ... else False

(SIM210)

1133-1133: Avoid equality comparisons to True; use if self.bin_peaks: for truth checks

Replace with self.bin_peaks

(E712)

🔇 Additional comments (18)

requirements.txt (1)

220-220:
✅ Verification successful

Polars dependency added to meet feature requirements.

The addition of polars>=0.20.7 is appropriate for enabling the new Polars DataFrame support functionality.

Let's verify the status of the specified polars version:

🏁 Script executed:
#!/bin/bash
# Check current polars version on PyPI and latest release date
LATEST=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.info.version')
RELEASE_DATE=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.releases["'$LATEST'"][0].upload_time')

echo "Requested version: >=0.20.7"
echo "Latest version on PyPI: $LATEST"
echo "Latest release date: $RELEASE_DATE"

# Check if 0.20.7 is a valid version
VERSION_EXISTS=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.releases | has("0.20.7")')
echo "Version 0.20.7 exists: $VERSION_EXISTS"

# Check minimum Python version required
MIN_PYTHON=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.info.requires_python')
echo "Minimum Python required: $MIN_PYTHON"
Length of output: 773
Polars Dependency Verification Passed

The verification confirms that the dependency polars>=0.20.7 is valid:

Version 0.20.7 exists on PyPI.

The latest available version is 1.25.2.

The minimum required Python version is >=3.9.

The addition of this dependency is appropriate for enabling the new Polars DataFrame support functionality.
README.md (2)

23-25: Clear feature announcements enhance user understanding.

These new bullet points effectively communicate the key enhancements of supporting both pandas and polars DataFrames, along with interactive plot capabilities and customization options.

64-85: Excellent Quick Start examples for both pandas and polars.

The added Quick Start section provides clear, concise examples demonstrating how to use both pandas and polars DataFrames with the new plot function. This will help users quickly understand how to leverage the new functionality.

pyopenms_viz/_dataframe.py (3)

8-8: Good type alias for DataFrame support.

Creating a DataFrameType type alias for the union of pandas and polars DataFrames enhances code readability and maintainability.

10-42: Well-structured DataFrameWrapper class with clear method implementations.

The DataFrameWrapper class provides a unified interface for handling both pandas and polars DataFrames, with clear methods for conversion, copying, and property access.

139-141: Well-defined utility function for wrapping DataFrames.

The wrap_dataframe function provides a clean interface for creating a DataFrameWrapper instance from either a pandas or polars DataFrame.

pyopenms_viz/__init__.py (3)

5-9: Good job introducing Polars imports.

These new imports for typing, pandas, and polars set the stage for dual DataFrame support. Just ensure your documentation clarifies that Polars is now an additional dependency.

11-11: Confirm usage of PlotAccessor.

You’ve added an import for PlotAccessor from pandas plotting. Verify that you truly intend to rely on Pandas’ internal PlotAccessor. If this is only for type references or if you rely on your custom logic, consider removing it to avoid confusion.

205-224: Introducing a unified plot function looks great.

This function clearly allows passing either a Pandas or Polars DataFrame. The interface is flexible, and the docstring is well-documented. Consider adding a compatibility check or fallback in case Polars is not installed.

pyopenms_viz/_bokeh/core.py (2)

299-299: Using to_pandas() on line data.

Converting your wrapper object to a Pandas DataFrame is valid for Bokeh's ColumnDataSource. Ensure that any polars DataFrame is also properly converted to Pandas at this step, which it is, thanks to the wrapper’s to_pandas method.

310-310: Likewise converting grouped data with to_pandas().

This usage is consistent for each grouped subset. Looks good.

pyopenms_viz/_matplotlib/core.py (1)

314-314: Group-by ordering logic.

The removal of sort=True is fine to preserve the data’s natural order. If stable grouping is required, document that or reintroduce an explicit sorting step.

pyopenms_viz/_core.py (6)

133-133: Constructor now wraps any DataFrame.

Using wrap_dataframe() is a clean approach for unified Pandas/Polars handling. Good move.

151-151: Casting grouping column to string.

Converting self.by column data to string helps avoid issues with non-string grouping keys. Looks sensible.

183-183: Integer column index check.

Well done verifying the user’s integer-based column references. This prevents out-of-range or numeric type confusion.

235-238: Aggregation logic for duplicate data.

This handling is aligned with your aggregate_duplicates flag. The approach is clear and warns the user if duplicates remain.

299-299: Duplicate data check in reference spectrum.

Ensuring your reference spectrum also respects aggregate_duplicates is a consistent approach.

304-308: Aggregation for reference spectrum.

Grouping by known columns, summing, and resetting the index for your reference data is consistent with the main data’s logic.

coderabbitai · 2025-03-18T12:17:22Z

pyopenms_viz/_plotly/core.py

@@ -2,7 +2,7 @@

 from abc import ABC

-from typing import List, Tuple, Union
+from typing import List, Tuple, Union, Any, Dict, Optional


⚠️ Potential issue

Imported type hints are not used in this file.

The added type hints Any, Dict, and Optional are imported but not used in this file, which could potentially cause linting issues.

-from typing import List, Tuple, Union, Any, Dict, Optional +from typing import List, Tuple, Union

Only add these imports if they'll be used in upcoming changes to this file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from typing import List, Tuple, Union, Any, Dict, Optional

from typing import List, Tuple, Union

🧰 Tools

🪛 Ruff (0.8.2)

5-5: typing.List imported but unused

Remove unused import

(F401)

5-5: typing.Union imported but unused

Remove unused import

(F401)

5-5: typing.Any imported but unused

Remove unused import

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Optional imported but unused

Remove unused import

(F401)

coderabbitai · 2025-03-18T12:17:22Z

pyopenms_viz/_plotly/core.py

@@ -28,6 +28,7 @@
 from .._config import bokeh_line_dash_mapper
 from .._misc import ColorGenerator, MarkerShapeGenerator, is_latex_formatted
 from ..constants import PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON
+from .._dataframe import DataFrameWrapper


⚠️ Potential issue

Imported DataFrameWrapper is not used in this file.

The DataFrameWrapper class is imported but not used in this file, which could cause linting issues.

-from .._dataframe import DataFrameWrapper

Only add this import if it will be used in this file. If the intent is to prepare for future changes where the DataFrame handling will be updated to use the wrapper, consider adding a comment explaining this.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from .._dataframe import DataFrameWrapper

🧰 Tools

🪛 Ruff (0.8.2)

31-31: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

coderabbitai · 2025-03-18T12:17:22Z

pyopenms_viz/_dataframe.py

+    def agg(self, func: dict) -> DataFrameWrapper:
+        """Aggregate using the specified functions."""
+        if self._is_polars:
+            agg_exprs = []
+            for col, agg_func in func.items():
+                if isinstance(agg_func, str):
+                    agg_exprs.append(pl.col(col).agg(agg_func))
+                else:
+                    agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
+            result = self._grouped.agg(agg_exprs)
+            return DataFrameWrapper(result)
+        else:
+            return DataFrameWrapper(self._grouped.agg(func))
+


⚠️ Potential issue

Fix potential closure issue in agg method.

The agg_func variable is used inside a lambda function within a loop, which could lead to unexpected behavior due to late binding.

def agg(self, func: dict) -> DataFrameWrapper: """Aggregate using the specified functions.""" if self._is_polars: agg_exprs = [] for col, agg_func in func.items(): if isinstance(agg_func, str): agg_exprs.append(pl.col(col).agg(agg_func)) else: - agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy()))) + # Create a closure to capture the current value of agg_func + agg_exprs.append(pl.col(col).agg(lambda x, func=agg_func: func(x.to_numpy()))) result = self._grouped.agg(agg_exprs) return DataFrameWrapper(result) else: return DataFrameWrapper(self._grouped.agg(func))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def agg(self, func: dict) -> DataFrameWrapper:

"""Aggregate using the specified functions."""

if self._is_polars:

agg_exprs = []

for col, agg_func in func.items():

if isinstance(agg_func, str):

agg_exprs.append(pl.col(col).agg(agg_func))

else:

agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))

result = self._grouped.agg(agg_exprs)

return DataFrameWrapper(result)

else:

return DataFrameWrapper(self._grouped.agg(func))

def agg(self, func: dict) -> DataFrameWrapper:

"""Aggregate using the specified functions."""

if self._is_polars:

agg_exprs = []

for col, agg_func in func.items():

if isinstance(agg_func, str):

agg_exprs.append(pl.col(col).agg(agg_func))

else:

# Create a closure to capture the current value of agg_func

agg_exprs.append(pl.col(col).agg(lambda x, func=agg_func: func(x.to_numpy())))

result = self._grouped.agg(agg_exprs)

return DataFrameWrapper(result)

else:

return DataFrameWrapper(self._grouped.agg(func))

🧰 Tools

🪛 Ruff (0.8.2)

132-132: Function definition does not bind loop variable agg_func

(B023)

coderabbitai · 2025-03-18T12:17:22Z

pyopenms_viz/_dataframe.py

+    def iterrows(self):
+        """Iterate over DataFrame rows."""
+        if self._is_polars:
+            for row in self._data.iter_rows(named=True):
+                yield row
+        else:
+            for idx, row in self._data.iterrows():
+                yield row
+


⚠️ Potential issue

Fix inconsistency in iterrows behavior.

The pandas iterrows() method returns a tuple of (index, row), but the implementation here returns only the row for both pandas and polars DataFrames. This inconsistency could lead to unexpected behavior.

def iterrows(self): """Iterate over DataFrame rows.""" if self._is_polars: for row in self._data.iter_rows(named=True): - yield row + yield None, row # Return (None, row) to match pandas pattern else: for idx, row in self._data.iterrows(): - yield row + yield idx, row # Return (idx, row) as pandas does

Alternatively, if you want to standardize on returning just the row, update the docstring to clarify this departure from the pandas behavior.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def iterrows(self):

"""Iterate over DataFrame rows."""

if self._is_polars:

for row in self._data.iter_rows(named=True):

yield row

else:

for idx, row in self._data.iterrows():

yield row

def iterrows(self):

"""Iterate over DataFrame rows."""

if self._is_polars:

for row in self._data.iter_rows(named=True):

yield None, row # Return (None, row) to match pandas pattern

else:

for idx, row in self._data.iterrows():

yield idx, row # Return (idx, row) as pandas does

🧰 Tools

🪛 Ruff (0.8.2)

93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

singjc

This is a good first step. This seems to implement a similar approach as PR 49. However, I'm still not sure if it's the best approach.

Now we will require two dependencies for dataframes, pandas and polars. So a user will have to install both of these, but maybe they only want one or the other?
Under the hood, for certain operations (i.e. bokeh source data), the data gets explicitly converted to a pandas df.
Because there is the to_pandas method, I'm not sure if we need the other methods in _dataframe.py, we could just directly convert the polars df to pandas df in the init plot method, then we wouldn't need to worry about the downstream stuff using pandas manipulations?
We still would like the option to perform method chaining. For your example in the README, it seems like we need to call plot(df, ...).

My only real concern/issue is the need for having to install two dependencies for pandas and polars. Do you think it would be possible to implement it in a way where polars + pyopenms_viz would not require pandas at all? Could we refactor/re-write some of the underlying plot methods that use pandas manipulations to make them agnostic to the type of dataframe?

@jcharkow, what are you thoughts?

singjc · 2025-03-24T14:18:46Z

README.md

+df = pd.DataFrame({
+    'mz': [100, 200, 300],
+    'intensity': [1000, 2000, 3000]
+})
+plot(df, x='mz', y='intensity', kind='spectrum')
+
+# Using polars DataFrame
+df_pl = pl.DataFrame({
+    'mz': [100, 200, 300],
+    'intensity': [1000, 2000, 3000]
+})
+plot(df_pl, x='mz', y='intensity', kind='spectrum')


Does the user have to explicitly call the plot method? The nice aspect of pyopenms_viz is that it was designed to be compatible with pandas plot method, so that you can perform object/method chaining df.plot(...). This example doesn't really show that. Can the same be done for the polars df example?

singjc · 2025-03-24T14:25:49Z

pyopenms_viz/_dataframe.py

+
+DataFrameType = Union[pd.DataFrame, pl.DataFrame]
+
+class DataFrameWrapper:


See the initial PR 49 suggestion:

One minor thing: maybe consider using names that match to common design pattern terminology.
Using DataFrameInterface and PandasDataFrameAdapter instead of e.g. wrapper

singjc · 2025-03-24T14:27:59Z

pyopenms_viz/_dataframe.py

+    def get_column(self, col: str) -> np.ndarray:
+        """Get a column as numpy array."""
+        if self._is_polars:
+            return self._data[col].to_numpy()
+        return self._data[col].to_numpy()
+
+    def set_column(self, col: str, value: Any) -> None:
+        """Set a column value."""
+        if self._is_polars:
+            self._data = self._data.with_columns(pl.Series(col, value))
+        else:
+            self._data[col] = value
+
+    def groupby(self, by: str) -> 'GroupByWrapper':
+        """Group the DataFrame by a column."""
+        if self._is_polars:
+            return GroupByWrapper(self._data.groupby(by), is_polars=True)
+        return GroupByWrapper(self._data.groupby(by), is_polars=False)
+
+    def fillna(self, value: Any) -> 'DataFrameWrapper':
+        """Fill NA/null values."""
+        if self._is_polars:
+            return DataFrameWrapper(self._data.fill_null(value))
+        return DataFrameWrapper(self._data.fillna(value))
+
+    def between(self, col: str, left: float, right: float) -> 'DataFrameWrapper':
+        """Select rows where column values are between left and right."""
+        if self._is_polars:
+            mask = (self._data[col] >= left) & (self._data[col] <= right)
+            return DataFrameWrapper(self._data.filter(mask))
+        return DataFrameWrapper(self._data[self._data[col].between(left, right)])
+
+    def max(self, col: str) -> float:
+        """Get maximum value of a column."""
+        if self._is_polars:
+            return self._data[col].max()
+        return self._data[col].max()
+
+    def min(self, col: str) -> float:
+        """Get minimum value of a column."""
+        if self._is_polars:
+            return self._data[col].min()
+        return self._data[col].min()
+
+    def iterrows(self):
+        """Iterate over DataFrame rows."""
+        if self._is_polars:
+            for row in self._data.iter_rows(named=True):
+                yield row
+        else:
+            for idx, row in self._data.iterrows():
+                yield row
+
+    @property
+    def columns(self) -> List[str]:
+        """Get column names."""
+        if self._is_polars:
+            return self._data.columns
+        return list(self._data.columns)
+
+    def __getitem__(self, key: str) -> np.ndarray:
+        """Get a column by name."""
+        return self.get_column(key)
+
+
+class GroupByWrapper:
+    """Wrapper for grouped DataFrame operations."""
+
+    def __init__(self, grouped, is_polars: bool):
+        self._grouped = grouped
+        self._is_polars = is_polars
+
+    def __iter__(self):
+        """Iterate over groups."""
+        if self._is_polars:
+            for name, group in self._grouped.groups():
+                yield name, DataFrameWrapper(group)
+        else:
+            for name, group in self._grouped:
+                yield name, DataFrameWrapper(group)
+
+    def agg(self, func: dict) -> DataFrameWrapper:
+        """Aggregate using the specified functions."""
+        if self._is_polars:
+            agg_exprs = []
+            for col, agg_func in func.items():
+                if isinstance(agg_func, str):
+                    agg_exprs.append(pl.col(col).agg(agg_func))
+                else:
+                    agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
+            result = self._grouped.agg(agg_exprs)
+            return DataFrameWrapper(result)
+        else:
+            return DataFrameWrapper(self._grouped.agg(func))
+
+
+def wrap_dataframe(data: DataFrameType) -> DataFrameWrapper:
+    """Create a DataFrameWrapper instance from a pandas or polars DataFrame."""
+    return DataFrameWrapper(data) 


Since there is a to_pandas method, I am wondering if these other methods are needed? Would it just be better to just convert the data to pandas upfront in the plotting init method, then nothing else downstream would require changing?

feat: Add Polars DataFrame support

06cfaec

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

jcharkow requested a review from singjc March 23, 2025 17:13

singjc reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Polars DataFrame support #71

feat: Add Polars DataFrame support #71

Uh oh!

SanjayUG commented Mar 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 18, 2025

Uh oh!

coderabbitai bot Mar 18, 2025

Uh oh!

coderabbitai bot Mar 18, 2025

Uh oh!

coderabbitai bot Mar 18, 2025

Uh oh!

singjc left a comment

Uh oh!

singjc Mar 24, 2025

Uh oh!

singjc Mar 24, 2025

Uh oh!

singjc Mar 24, 2025

Uh oh!

Uh oh!

	from typing import List, Tuple, Union, Any, Dict, Optional
	from typing import List, Tuple, Union


		DataFrameType = Union[pd.DataFrame, pl.DataFrame]

		class DataFrameWrapper:

feat: Add Polars DataFrame support #71

Are you sure you want to change the base?

feat: Add Polars DataFrame support #71

Uh oh!

Conversation

SanjayUG commented Mar 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request description