Skip to content

Conversation

SanjayUG
Copy link

@SanjayUG SanjayUG commented Mar 18, 2025

Pull Request description

feat: Add Polars DataFrame support

  • Create DataFrame abstraction layer for pandas/polars compatibility
  • Update plotting backends to handle both DataFrame types
  • Add polars dependency and documentation
  • Maintain existing API and functionality

Which issue this PR aims to resolve or fix?

Solve #55

How to test these changes

  • Ensure you have both pandas and polars installed
  • Run existing tests to ensure compatibility with both DataFrame types
  • Verify that plotting functions work with both pandas and polars DataFrames

Pull Request checklists

This PR is a:

  • bug-fix
  • new feature
  • maintenance

About this PR:

  • it includes tests.
  • the tests are executed on CI.
  • the tests generate log file(s) (path).
  • pre-commit hooks were executed locally.
  • this PR requires a project documentation update.

Author's checklist:

  • I have reviewed the changes and it contains no misspelling.
  • The code is well commented, especially in the parts that contain more
    complexity.
  • New and old tests passed locally.

Additional information

Reviewer's checklist

Copy and paste this template for your review's note:

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch
- [ ] I managed to test the new changes locally
- [ ] I confirm that the issues mentioned were fixed/resolved.

Summary by CodeRabbit

  • New Features

    • Expanded support for both pandas and polars DataFrames, offering greater flexibility for data visualization.
    • Enhanced interactive plotting with intuitive zoom, pan, and hover functionalities, plus improved customizable styling and annotations.
  • Documentation

    • Introduced a new "Quick Start" guide featuring example code to streamline the onboarding process for new users.

Copy link

coderabbitai bot commented Mar 18, 2025

Walkthrough

This pull request enhances the pyOpenMS-Viz library by adding support for both pandas and polars DataFrames. The documentation is updated with a new "Quick Start" section and example code, while a new plot function is introduced to accept either DataFrame type. Internal modifications improve data handling by refactoring column accesses through dedicated methods and wrapping DataFrames using a unified interface. Additionally, the changes update type hints and imports in various visualization backends and add a new dependency on the polars library.

Changes

File(s) Change Summary
README.md - Added a new "Quick Start" section with code examples showcasing pandas and polars usage.
- Updated documentation to highlight interactive plot features and customizable styling.
pyopenms_viz/init.py - Introduced a new plot function supporting both pandas and polars DataFrames.
- Enhanced the function with detailed docstrings and flexible argument handling through PlotAccessor.
pyopenms_viz/_bokeh/core.py, pyopenms_viz/_matplotlib/core.py, pyopenms_viz/_plotly/core.py - Updated import statements to include extended type hints and new libraries (e.g., numpy, DataFrameWrapper).
- In Bokeh, data is now converted with .to_pandas() for ColumnDataSource compatibility; Matplotlib’s groupby call has been adjusted.
pyopenms_viz/_core.py - Modified the BasePlot class to accept a union of DataFrame types using wrap_dataframe.
- Replaced direct column access with get_column and set_column methods.
- Added new imports for enhanced visualization handling.
pyopenms_viz/_dataframe.py - Added a new file providing the DataFrameWrapper and GroupByWrapper classes, along with the wrap_dataframe function, offering a unified interface for both pandas and polars DataFrames.
requirements.txt - Added a dependency: polars>=0.20.7.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant P as pyopenms_viz.plot
    participant DF as DataFrameWrapper
    participant PA as PlotAccessor

    U->>P: Call plot(data, *args, **kwargs)
    P->>DF: wrap_dataframe(data)
    DF-->>P: Return wrapped DataFrame
    P->>PA: Initialize PlotAccessor with wrapped data and parameters
    PA-->>P: Provide plotting interface
    P-->>U: Return interactive plot
Loading

Possibly related issues

Poem

I'm a rabbit, hopping with glee,
New plots and wrappers set my code free.
Pandas and polars now side by side,
Quick Start carrots guide the ride.
In fields of data, I dance and play—
Celebrating changes in a joyful way!
🐇✨

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (13)
requirements.txt (1)

220-221: Consider maintaining alphabetical order in requirements.txt

For better readability and maintainability, consider placing the polars dependency in alphabetical order or regenerating the file using pip-compile to maintain consistency.

The polars dependency could be placed between lines 126 and 127, following alphabetical ordering with the other packages.

README.md (1)

86-89: Consider a more professional phrasing for contributions

The language "feel free to" is commonly used but could be replaced with a more professional alternative.

-Contributions are welcome! Please feel free to submit a Pull Request.
+Contributions are welcome! We encourage you to submit a Pull Request.
🧰 Tools
🪛 LanguageTool

[style] ~88-~88: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ...ontributing Contributions are welcome! Please feel free to submit a Pull Request. ## License Thi...

(FEEL_FREE_TO_STYLE_ME)

pyopenms_viz/_dataframe.py (6)

3-3: Remove unused Optional type import.

The Optional type is imported but never used in this file.

-from typing import Any, Union, List, Optional
+from typing import Any, Union, List
🧰 Tools
🪛 Ruff (0.8.2)

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)


43-48: Simplify the get_column method implementation.

Both branches of the conditional perform the same operation (converting the column to a numpy array). This can be simplified.

def get_column(self, col: str) -> np.ndarray:
    """Get a column as numpy array."""
-    if self._is_polars:
-        return self._data[col].to_numpy()
-    return self._data[col].to_numpy()
+    return self._data[col].to_numpy()

75-80: Simplify the max method implementation.

Both branches of the conditional perform the same operation. This can be simplified.

def max(self, col: str) -> float:
    """Get maximum value of a column."""
-    if self._is_polars:
-        return self._data[col].max()
-    return self._data[col].max()
+    return self._data[col].max()

81-86: Simplify the min method implementation.

Both branches of the conditional perform the same operation. This can be simplified.

def min(self, col: str) -> float:
    """Get minimum value of a column."""
-    if self._is_polars:
-        return self._data[col].min()
-    return self._data[col].min()
+    return self._data[col].min()

93-93: Rename unused loop variable.

The loop variable idx is not used within the loop body. By convention, unused variables should be prefixed with an underscore.

-        for idx, row in self._data.iterrows():
+        for _idx, row in self._data.iterrows():
🧰 Tools
🪛 Ruff (0.8.2)

93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


1-141: Consider adding more comprehensive DataFrame operations.

The current implementation covers basic DataFrame operations, but you might need to add more methods as the library evolves. Consider implementing additional common operations like:

  1. Filtering (e.g., df[df['column'] > value])
  2. Joining/merging DataFrames
  3. Selection of multiple columns
  4. DataFrame shape property
  5. Row/column addition functions

Would you like me to suggest implementations for any of these additional methods?

🧰 Tools
🪛 Ruff (0.8.2)

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)


93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


132-132: Function definition does not bind loop variable agg_func

(B023)

pyopenms_viz/_bokeh/core.py (1)

5-5: Remove or utilize unused imports.

Static analysis indicates that Iterator, Any, Dict, List, Optional, Union, np, and DataFrameWrapper are imported but never used. Consider removing them to keep the code clean.

-from typing import Tuple, Iterator, Any, Dict, List, Optional, Union
+from typing import Tuple
 ...
-import numpy as np
 ...
-from .._dataframe import DataFrameWrapper

Also applies to: 24-24, 39-39

🧰 Tools
🪛 Ruff (0.8.2)

5-5: typing.Iterator imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.List imported but unused

Remove unused import

(F401)


5-5: typing.Optional imported but unused

Remove unused import

(F401)


5-5: typing.Union imported but unused

Remove unused import

(F401)

pyopenms_viz/_matplotlib/core.py (1)

4-4: Avoid importing unused symbols.

Imports for Any, Dict, List, Optional, Union, re, Line2D, Figure, np, DataFrame, Axes3D, and DataFrameWrapper seem unused per static analysis. Remove them if not needed.

-from typing import Tuple, Any, Dict, List, Optional, Union
+from typing import Tuple
 ...
-import re
 ...
-from matplotlib.lines import Line2D
-from matplotlib.figure import Figure
 ...
-import numpy as np
-from pandas import DataFrame
 ...
-from .._dataframe import DataFrameWrapper

Also applies to: 12-12, 13-13, 31-31

🧰 Tools
🪛 Ruff (0.8.2)

4-4: typing.Any imported but unused

Remove unused import

(F401)


4-4: typing.Dict imported but unused

Remove unused import

(F401)


4-4: typing.List imported but unused

Remove unused import

(F401)


4-4: typing.Optional imported but unused

Remove unused import

(F401)


4-4: typing.Union imported but unused

Remove unused import

(F401)

pyopenms_viz/_core.py (3)

34-35: Remove or utilize these unused imports.

MarkerShapeGenerator, is_latex_formatted, PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON, DataFrameWrapper are flagged as unused.

-from ._misc import (
-    MarkerShapeGenerator,
-    is_latex_formatted,
-)
 ...
-from .constants import IS_SPHINX_BUILD, IS_NOTEBOOK, PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON
 ...
-from ._dataframe import DataFrameType, DataFrameWrapper, wrap_dataframe
+from ._dataframe import DataFrameType, wrap_dataframe

Also applies to: 37-37, 39-39

🧰 Tools
🪛 Ruff (0.8.2)

34-34: ._misc.MarkerShapeGenerator imported but unused

Remove unused import

(F401)


35-35: ._misc.is_latex_formatted imported but unused

Remove unused import

(F401)


1003-1003: Simplify conditional expression.

Use a direct Boolean expression:

-ascending=True if data.get_column(y).min() < 0 else False
+ascending=data.get_column(y).min() < 0
🧰 Tools
🪛 Ruff (0.8.2)

1003-1003: Remove unnecessary True if ... else False

Remove unnecessary True if ... else False

(SIM210)


1133-1133: Avoid comparing boolean with True.

As recommended by your static analysis, remove explicit equality:

-if self.bin_peaks == True or (
+if self.bin_peaks or (
🧰 Tools
🪛 Ruff (0.8.2)

1133-1133: Avoid equality comparisons to True; use if self.bin_peaks: for truth checks

Replace with self.bin_peaks

(E712)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40549e8 and 06cfaec.

📒 Files selected for processing (8)
  • README.md (2 hunks)
  • pyopenms_viz/__init__.py (2 hunks)
  • pyopenms_viz/_bokeh/core.py (5 hunks)
  • pyopenms_viz/_core.py (23 hunks)
  • pyopenms_viz/_dataframe.py (1 hunks)
  • pyopenms_viz/_matplotlib/core.py (3 hunks)
  • pyopenms_viz/_plotly/core.py (2 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code Definitions (3)
pyopenms_viz/_bokeh/core.py (1)
pyopenms_viz/_dataframe.py (3) (3)
  • DataFrameWrapper (10:105)
  • data (21:23)
  • to_pandas (25:29)
pyopenms_viz/_matplotlib/core.py (1)
pyopenms_viz/_dataframe.py (3) (3)
  • DataFrameWrapper (10:105)
  • data (21:23)
  • groupby (56:60)
pyopenms_viz/_core.py (1)
pyopenms_viz/_dataframe.py (9) (9)
  • DataFrameWrapper (10:105)
  • wrap_dataframe (139:141)
  • data (21:23)
  • set_column (49:54)
  • get_column (43:47)
  • groupby (56:60)
  • max (75:79)
  • min (81:85)
  • columns (97:101)
🪛 LanguageTool
README.md

[style] ~88-~88: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ...ontributing Contributions are welcome! Please feel free to submit a Pull Request. ## License Thi...

(FEEL_FREE_TO_STYLE_ME)

🪛 Ruff (0.8.2)
pyopenms_viz/_plotly/core.py

5-5: typing.List imported but unused

Remove unused import

(F401)


5-5: typing.Union imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Optional imported but unused

Remove unused import

(F401)


31-31: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

pyopenms_viz/_dataframe.py

3-3: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)


93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


132-132: Function definition does not bind loop variable agg_func

(B023)

pyopenms_viz/_bokeh/core.py

5-5: typing.Iterator imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.List imported but unused

Remove unused import

(F401)


5-5: typing.Optional imported but unused

Remove unused import

(F401)


5-5: typing.Union imported but unused

Remove unused import

(F401)


24-24: numpy imported but unused

Remove unused import: numpy

(F401)


39-39: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

pyopenms_viz/_matplotlib/core.py

4-4: typing.Any imported but unused

Remove unused import

(F401)


4-4: typing.Dict imported but unused

Remove unused import

(F401)


4-4: typing.List imported but unused

Remove unused import

(F401)


4-4: typing.Optional imported but unused

Remove unused import

(F401)


4-4: typing.Union imported but unused

Remove unused import

(F401)


5-5: re imported but unused

Remove unused import: re

(F401)


8-8: matplotlib.lines.Line2D imported but unused

Remove unused import: matplotlib.lines.Line2D

(F401)


11-11: matplotlib.figure.Figure imported but unused

Remove unused import: matplotlib.figure.Figure

(F401)


12-12: numpy imported but unused

Remove unused import: numpy

(F401)


13-13: pandas.DataFrame imported but unused

Remove unused import: pandas.DataFrame

(F401)


14-14: mpl_toolkits.mplot3d.Axes3D imported but unused

Remove unused import: mpl_toolkits.mplot3d.Axes3D

(F401)


31-31: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

pyopenms_viz/_core.py

34-34: ._misc.MarkerShapeGenerator imported but unused

Remove unused import

(F401)


35-35: ._misc.is_latex_formatted imported but unused

Remove unused import

(F401)


37-37: .constants.PEAK_BOUNDARY_ICON imported but unused

Remove unused import

(F401)


37-37: .constants.FEATURE_BOUNDARY_ICON imported but unused

Remove unused import

(F401)


39-39: ._dataframe.DataFrameWrapper imported but unused

Remove unused import: ._dataframe.DataFrameWrapper

(F401)


1003-1003: Remove unnecessary True if ... else False

Remove unnecessary True if ... else False

(SIM210)


1133-1133: Avoid equality comparisons to True; use if self.bin_peaks: for truth checks

Replace with self.bin_peaks

(E712)

🔇 Additional comments (18)
requirements.txt (1)

220-220:

✅ Verification successful

Polars dependency added to meet feature requirements.

The addition of polars>=0.20.7 is appropriate for enabling the new Polars DataFrame support functionality.

Let's verify the status of the specified polars version:


🏁 Script executed:

#!/bin/bash
# Check current polars version on PyPI and latest release date
LATEST=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.info.version')
RELEASE_DATE=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.releases["'$LATEST'"][0].upload_time')

echo "Requested version: >=0.20.7"
echo "Latest version on PyPI: $LATEST"
echo "Latest release date: $RELEASE_DATE"

# Check if 0.20.7 is a valid version
VERSION_EXISTS=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.releases | has("0.20.7")')
echo "Version 0.20.7 exists: $VERSION_EXISTS"

# Check minimum Python version required
MIN_PYTHON=$(curl -s https://pypi.org/pypi/polars/json | jq -r '.info.requires_python')
echo "Minimum Python required: $MIN_PYTHON"

Length of output: 773


Polars Dependency Verification Passed

The verification confirms that the dependency polars>=0.20.7 is valid:

  • Version 0.20.7 exists on PyPI.
  • The latest available version is 1.25.2.
  • The minimum required Python version is >=3.9.

The addition of this dependency is appropriate for enabling the new Polars DataFrame support functionality.

README.md (2)

23-25: Clear feature announcements enhance user understanding.

These new bullet points effectively communicate the key enhancements of supporting both pandas and polars DataFrames, along with interactive plot capabilities and customization options.


64-85: Excellent Quick Start examples for both pandas and polars.

The added Quick Start section provides clear, concise examples demonstrating how to use both pandas and polars DataFrames with the new plot function. This will help users quickly understand how to leverage the new functionality.

pyopenms_viz/_dataframe.py (3)

8-8: Good type alias for DataFrame support.

Creating a DataFrameType type alias for the union of pandas and polars DataFrames enhances code readability and maintainability.


10-42: Well-structured DataFrameWrapper class with clear method implementations.

The DataFrameWrapper class provides a unified interface for handling both pandas and polars DataFrames, with clear methods for conversion, copying, and property access.


139-141: Well-defined utility function for wrapping DataFrames.

The wrap_dataframe function provides a clean interface for creating a DataFrameWrapper instance from either a pandas or polars DataFrame.

pyopenms_viz/__init__.py (3)

5-9: Good job introducing Polars imports.

These new imports for typing, pandas, and polars set the stage for dual DataFrame support. Just ensure your documentation clarifies that Polars is now an additional dependency.


11-11: Confirm usage of PlotAccessor.

You’ve added an import for PlotAccessor from pandas plotting. Verify that you truly intend to rely on Pandas’ internal PlotAccessor. If this is only for type references or if you rely on your custom logic, consider removing it to avoid confusion.


205-224: Introducing a unified plot function looks great.

This function clearly allows passing either a Pandas or Polars DataFrame. The interface is flexible, and the docstring is well-documented. Consider adding a compatibility check or fallback in case Polars is not installed.

pyopenms_viz/_bokeh/core.py (2)

299-299: Using to_pandas() on line data.

Converting your wrapper object to a Pandas DataFrame is valid for Bokeh's ColumnDataSource. Ensure that any polars DataFrame is also properly converted to Pandas at this step, which it is, thanks to the wrapper’s to_pandas method.


310-310: Likewise converting grouped data with to_pandas().

This usage is consistent for each grouped subset. Looks good.

pyopenms_viz/_matplotlib/core.py (1)

314-314: Group-by ordering logic.

The removal of sort=True is fine to preserve the data’s natural order. If stable grouping is required, document that or reintroduce an explicit sorting step.

pyopenms_viz/_core.py (6)

133-133: Constructor now wraps any DataFrame.

Using wrap_dataframe() is a clean approach for unified Pandas/Polars handling. Good move.


151-151: Casting grouping column to string.

Converting self.by column data to string helps avoid issues with non-string grouping keys. Looks sensible.


183-183: Integer column index check.

Well done verifying the user’s integer-based column references. This prevents out-of-range or numeric type confusion.


235-238: Aggregation logic for duplicate data.

This handling is aligned with your aggregate_duplicates flag. The approach is clear and warns the user if duplicates remain.


299-299: Duplicate data check in reference spectrum.

Ensuring your reference spectrum also respects aggregate_duplicates is a consistent approach.


304-308: Aggregation for reference spectrum.

Grouping by known columns, summing, and resetting the index for your reference data is consistent with the main data’s logic.

@@ -2,7 +2,7 @@

from abc import ABC

from typing import List, Tuple, Union
from typing import List, Tuple, Union, Any, Dict, Optional
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Imported type hints are not used in this file.

The added type hints Any, Dict, and Optional are imported but not used in this file, which could potentially cause linting issues.

-from typing import List, Tuple, Union, Any, Dict, Optional
+from typing import List, Tuple, Union

Only add these imports if they'll be used in upcoming changes to this file.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import List, Tuple, Union, Any, Dict, Optional
from typing import List, Tuple, Union
🧰 Tools
🪛 Ruff (0.8.2)

5-5: typing.List imported but unused

Remove unused import

(F401)


5-5: typing.Union imported but unused

Remove unused import

(F401)


5-5: typing.Any imported but unused

Remove unused import

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Optional imported but unused

Remove unused import

(F401)

@@ -28,6 +28,7 @@
from .._config import bokeh_line_dash_mapper
from .._misc import ColorGenerator, MarkerShapeGenerator, is_latex_formatted
from ..constants import PEAK_BOUNDARY_ICON, FEATURE_BOUNDARY_ICON
from .._dataframe import DataFrameWrapper
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Imported DataFrameWrapper is not used in this file.

The DataFrameWrapper class is imported but not used in this file, which could cause linting issues.

-from .._dataframe import DataFrameWrapper

Only add this import if it will be used in this file. If the intent is to prepare for future changes where the DataFrame handling will be updated to use the wrapper, consider adding a comment explaining this.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from .._dataframe import DataFrameWrapper
🧰 Tools
🪛 Ruff (0.8.2)

31-31: .._dataframe.DataFrameWrapper imported but unused

Remove unused import: .._dataframe.DataFrameWrapper

(F401)

Comment on lines +124 to +137
def agg(self, func: dict) -> DataFrameWrapper:
"""Aggregate using the specified functions."""
if self._is_polars:
agg_exprs = []
for col, agg_func in func.items():
if isinstance(agg_func, str):
agg_exprs.append(pl.col(col).agg(agg_func))
else:
agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
result = self._grouped.agg(agg_exprs)
return DataFrameWrapper(result)
else:
return DataFrameWrapper(self._grouped.agg(func))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential closure issue in agg method.

The agg_func variable is used inside a lambda function within a loop, which could lead to unexpected behavior due to late binding.

def agg(self, func: dict) -> DataFrameWrapper:
    """Aggregate using the specified functions."""
    if self._is_polars:
        agg_exprs = []
        for col, agg_func in func.items():
            if isinstance(agg_func, str):
                agg_exprs.append(pl.col(col).agg(agg_func))
            else:
-                agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
+                # Create a closure to capture the current value of agg_func
+                agg_exprs.append(pl.col(col).agg(lambda x, func=agg_func: func(x.to_numpy())))
        result = self._grouped.agg(agg_exprs)
        return DataFrameWrapper(result)
    else:
        return DataFrameWrapper(self._grouped.agg(func))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def agg(self, func: dict) -> DataFrameWrapper:
"""Aggregate using the specified functions."""
if self._is_polars:
agg_exprs = []
for col, agg_func in func.items():
if isinstance(agg_func, str):
agg_exprs.append(pl.col(col).agg(agg_func))
else:
agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
result = self._grouped.agg(agg_exprs)
return DataFrameWrapper(result)
else:
return DataFrameWrapper(self._grouped.agg(func))
def agg(self, func: dict) -> DataFrameWrapper:
"""Aggregate using the specified functions."""
if self._is_polars:
agg_exprs = []
for col, agg_func in func.items():
if isinstance(agg_func, str):
agg_exprs.append(pl.col(col).agg(agg_func))
else:
# Create a closure to capture the current value of agg_func
agg_exprs.append(pl.col(col).agg(lambda x, func=agg_func: func(x.to_numpy())))
result = self._grouped.agg(agg_exprs)
return DataFrameWrapper(result)
else:
return DataFrameWrapper(self._grouped.agg(func))
🧰 Tools
🪛 Ruff (0.8.2)

132-132: Function definition does not bind loop variable agg_func

(B023)

Comment on lines +87 to +95
def iterrows(self):
"""Iterate over DataFrame rows."""
if self._is_polars:
for row in self._data.iter_rows(named=True):
yield row
else:
for idx, row in self._data.iterrows():
yield row

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix inconsistency in iterrows behavior.

The pandas iterrows() method returns a tuple of (index, row), but the implementation here returns only the row for both pandas and polars DataFrames. This inconsistency could lead to unexpected behavior.

def iterrows(self):
    """Iterate over DataFrame rows."""
    if self._is_polars:
        for row in self._data.iter_rows(named=True):
-            yield row
+            yield None, row  # Return (None, row) to match pandas pattern
    else:
        for idx, row in self._data.iterrows():
-            yield row
+            yield idx, row  # Return (idx, row) as pandas does

Alternatively, if you want to standardize on returning just the row, update the docstring to clarify this departure from the pandas behavior.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def iterrows(self):
"""Iterate over DataFrame rows."""
if self._is_polars:
for row in self._data.iter_rows(named=True):
yield row
else:
for idx, row in self._data.iterrows():
yield row
def iterrows(self):
"""Iterate over DataFrame rows."""
if self._is_polars:
for row in self._data.iter_rows(named=True):
yield None, row # Return (None, row) to match pandas pattern
else:
for idx, row in self._data.iterrows():
yield idx, row # Return (idx, row) as pandas does
🧰 Tools
🪛 Ruff (0.8.2)

93-93: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

@jcharkow jcharkow requested a review from singjc March 23, 2025 17:13
Copy link
Collaborator

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good first step. This seems to implement a similar approach as PR 49. However, I'm still not sure if it's the best approach.

  1. Now we will require two dependencies for dataframes, pandas and polars. So a user will have to install both of these, but maybe they only want one or the other?
  2. Under the hood, for certain operations (i.e. bokeh source data), the data gets explicitly converted to a pandas df.
  3. Because there is the to_pandas method, I'm not sure if we need the other methods in _dataframe.py, we could just directly convert the polars df to pandas df in the init plot method, then we wouldn't need to worry about the downstream stuff using pandas manipulations?
  4. We still would like the option to perform method chaining. For your example in the README, it seems like we need to call plot(df, ...).

My only real concern/issue is the need for having to install two dependencies for pandas and polars. Do you think it would be possible to implement it in a way where polars + pyopenms_viz would not require pandas at all? Could we refactor/re-write some of the underlying plot methods that use pandas manipulations to make them agnostic to the type of dataframe?

@jcharkow, what are you thoughts?

Comment on lines +72 to +83
df = pd.DataFrame({
'mz': [100, 200, 300],
'intensity': [1000, 2000, 3000]
})
plot(df, x='mz', y='intensity', kind='spectrum')

# Using polars DataFrame
df_pl = pl.DataFrame({
'mz': [100, 200, 300],
'intensity': [1000, 2000, 3000]
})
plot(df_pl, x='mz', y='intensity', kind='spectrum')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user have to explicitly call the plot method? The nice aspect of pyopenms_viz is that it was designed to be compatible with pandas plot method, so that you can perform object/method chaining df.plot(...). This example doesn't really show that. Can the same be done for the polars df example?


DataFrameType = Union[pd.DataFrame, pl.DataFrame]

class DataFrameWrapper:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the initial PR 49 suggestion:

One minor thing: maybe consider using names that match to common design pattern terminology.
Using DataFrameInterface and PandasDataFrameAdapter instead of e.g. wrapper

Comment on lines +43 to +141
def get_column(self, col: str) -> np.ndarray:
"""Get a column as numpy array."""
if self._is_polars:
return self._data[col].to_numpy()
return self._data[col].to_numpy()

def set_column(self, col: str, value: Any) -> None:
"""Set a column value."""
if self._is_polars:
self._data = self._data.with_columns(pl.Series(col, value))
else:
self._data[col] = value

def groupby(self, by: str) -> 'GroupByWrapper':
"""Group the DataFrame by a column."""
if self._is_polars:
return GroupByWrapper(self._data.groupby(by), is_polars=True)
return GroupByWrapper(self._data.groupby(by), is_polars=False)

def fillna(self, value: Any) -> 'DataFrameWrapper':
"""Fill NA/null values."""
if self._is_polars:
return DataFrameWrapper(self._data.fill_null(value))
return DataFrameWrapper(self._data.fillna(value))

def between(self, col: str, left: float, right: float) -> 'DataFrameWrapper':
"""Select rows where column values are between left and right."""
if self._is_polars:
mask = (self._data[col] >= left) & (self._data[col] <= right)
return DataFrameWrapper(self._data.filter(mask))
return DataFrameWrapper(self._data[self._data[col].between(left, right)])

def max(self, col: str) -> float:
"""Get maximum value of a column."""
if self._is_polars:
return self._data[col].max()
return self._data[col].max()

def min(self, col: str) -> float:
"""Get minimum value of a column."""
if self._is_polars:
return self._data[col].min()
return self._data[col].min()

def iterrows(self):
"""Iterate over DataFrame rows."""
if self._is_polars:
for row in self._data.iter_rows(named=True):
yield row
else:
for idx, row in self._data.iterrows():
yield row

@property
def columns(self) -> List[str]:
"""Get column names."""
if self._is_polars:
return self._data.columns
return list(self._data.columns)

def __getitem__(self, key: str) -> np.ndarray:
"""Get a column by name."""
return self.get_column(key)


class GroupByWrapper:
"""Wrapper for grouped DataFrame operations."""

def __init__(self, grouped, is_polars: bool):
self._grouped = grouped
self._is_polars = is_polars

def __iter__(self):
"""Iterate over groups."""
if self._is_polars:
for name, group in self._grouped.groups():
yield name, DataFrameWrapper(group)
else:
for name, group in self._grouped:
yield name, DataFrameWrapper(group)

def agg(self, func: dict) -> DataFrameWrapper:
"""Aggregate using the specified functions."""
if self._is_polars:
agg_exprs = []
for col, agg_func in func.items():
if isinstance(agg_func, str):
agg_exprs.append(pl.col(col).agg(agg_func))
else:
agg_exprs.append(pl.col(col).agg(lambda x: agg_func(x.to_numpy())))
result = self._grouped.agg(agg_exprs)
return DataFrameWrapper(result)
else:
return DataFrameWrapper(self._grouped.agg(func))


def wrap_dataframe(data: DataFrameType) -> DataFrameWrapper:
"""Create a DataFrameWrapper instance from a pandas or polars DataFrame."""
return DataFrameWrapper(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a to_pandas method, I am wondering if these other methods are needed? Would it just be better to just convert the data to pandas upfront in the plotting init method, then nothing else downstream would require changing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants