Skip to content

Conversation

@adrianlut
Copy link
Contributor

Issue #, if available: No issue number available

Description of changes:

Added DataFrame.sample() to the list of supported pandas functions.

Furthermore added it ot the README and added a test. This PR requires #38. The test implemented only works correctly with the fix from #38. It also verifies that the fix from #38 works correctly for unary selections.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov-commenter
Copy link

Codecov Report

Merging #48 (d9812e6) into master (7d6137d) will decrease coverage by 0.01%.
The diff coverage is 97.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #48      +/-   ##
==========================================
- Coverage   96.22%   96.21%   -0.02%     
==========================================
  Files          33       33              
  Lines        2145     2165      +20     
==========================================
+ Hits         2064     2083      +19     
- Misses         81       82       +1     
Impacted Files Coverage Δ
mlinspect/monkeypatching/_patch_pandas.py 96.37% <94.44%> (-0.16%) ⬇️
mlinspect/backends/_iter_creation.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d6137d...d9812e6. Read the comment docs.

Copy link
Owner

@stefan-grafberger stefan-grafberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work; this is great!

I think it's awesome that you're able to add support for a new API function. Of course, there's still a lot of room for improvement to make this process even easier in the future, but I'm really happy that this part of the codebase is now understable enough for others to make changes. :-)

| `('pandas.core.frame', '__getitem__')`, arg type: strings | Projection|
| `('pandas.core.frame', '__getitem__')`, arg type: series | Selection |
| `('pandas.core.frame', 'dropna')` | Selection |
| `('pandas.core.frame', 'sample')` | Selection |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't use the selection operator for this but introduce a new one that captures its semantics better. Something along the lines of OperatorTypes.RESAMPLE, what do you think?

The existing inspections and checks need a minor update then also to handle this new operator type. And then we should add new tests for the inspections and checks where it makes sense to check that the new behavior works. That's mainly NoBiasIntroducedFor and HistogramForColumns. The tests for that are in test/inspections/test_histogram_for_columns.py and test/check/test_no_bias_introduced_for.py.

Would you be willing to add that also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the selection operator is fitting the sampling operation. After all, it (randomly) selects a subset of rows from the DataFrame and the iterator creation ensures that inspections do not have to deal with the row order.

However, I know that you implemented some inspections that assume constant row order and I can therefore understand the idea of creating a new operator type for selections that do not preserve order.

I would also suggest thinking about methods (properties) like loc, iloc, and sort_values that can change the row order without selecting. I don't know if resample is the best word to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the frac option is allowed to be bigger than 1. Than it obviously isn't a selection anymore. So yes, adding a new OperatorType seems to be a good idea. But not for this evening.

Copy link
Owner

@stefan-grafberger stefan-grafberger Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment! Yes, upsampling with frac > 1 was the main thing I was concerned about. I guess if loc and iloc are used for selecting rows then OperatorType.SELECTION would be appropriate. Do you think OperatorType.RESAMPLE would be alright then here? Do you have another naming suggestion? In this context, the dataframe algera presented in this paper is also very interesting if that kind of stuff is interesting for you.

Very good decision to stop working at this time of the day :-) If you decide adding a new OperatorType is too much work or if you have any questions, just let me know. Thanks for addressing all the other code review comments!

DagNodeDetails(None, ['A']),
OptionalCodeInfo(CodeReference(3, 5, 3, 54),
"pd.DataFrame([0, 2, 4, 5, 10, 15], columns=['A'])"))
expected_select = DagNode(1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name should be updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants