- 
                Notifications
    You must be signed in to change notification settings 
- Fork 674
feat: add n_obs_aggregated column to aggregated AnnData output #3824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
| Wow @RohanDisa thanks for taking this on! 
 Hmmm yes the way I wrote that is not helpful at all.  It should have been more like  The idea would be to give a way to get the fraction of cells aggregated in a given cell-patient combination. Perhaps this is actually a better metric. So I think then  However, I think it's fine to leave this out actually and only add  | 
        
          
                tests/test_aggregated.py
              
                Outdated
          
        
      | # Counts should be positive | ||
| assert (result.obs["n_obs_aggregated"] > 0).all() | ||
| # Total counts should equal original n_obs | ||
| assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong, you'd want the per-louvain group counts
        
          
                tests/test_aggregated.py
              
                Outdated
          
        
      | ) | ||
| assert "n_obs_aggregated" in result.obs | ||
| # Still sums back to the total number of obs | ||
| assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same on the count check
        
          
                tests/test_aggregated.py
              
                Outdated
          
        
      | result = sc.get.aggregate( | ||
| pbmc_adata, by=["louvain", "percent_mito_binned"], func="mean" | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to aggregate percent_mito_binned because it is not a categorical
        
          
                tests/test_aggregated.py
              
                Outdated
          
        
      | # Check column exists | ||
| assert "n_obs_aggregated" in result.obs | ||
| # Counts should be positive | ||
| assert (result.obs["n_obs_aggregated"] > 0).all() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check this given the below check
        
          
                tests/test_aggregated.py
              
                Outdated
          
        
      | assert "n_obs_aggregated" in result.obs | ||
| # Only groups with data should appear | ||
| assert set(result.obs["fake_group"]) == {"A", "B"} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check these
        
          
                tests/test_aggregated.py
              
                Outdated
          
        
      | # Only groups with data should appear | ||
| assert set(result.obs["fake_group"]) == {"A", "B"} | ||
| # Count check | ||
| assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same on the count check
Use hardcoded expected values instead of duplicating implementation logic for more reliable test verification.
for more information, see https://pre-commit.ci
| Hey @ilan-gold, Right now I’ve written tests using synthetic toy AnnData objects where the group counts are known exactly. This makes it easy to validate n_obs_aggregated. Do you think I should also add a test using the small PBMC test dataset (pbmc3k_parametrized_small) to ensure things behave correctly on a real dataset, or are the current toy-data tests sufficient? | 
| 
 I think that's fine actually. | 
| @RohanDisa sorry about the unhelpful test failures previously but now you should be getting helpful ones :) they look simple enough | 
| Hey @ilan-gold, I wanted to provide some context about the CI failures in the aggregation tests. The issue is caused by the recent addition of the n_obs_aggregated column in aggregate(): group_sizes = pd.Series(categorical).value_counts().reindex(new_label_df.index) This adds an extra column to .obs, which is why existing aggregation tests are failing with shape mismatches, e.g.: [left]: (3, 1) [right]: (3, 2) The last two tests I added specifically check for n_obs_aggregated and pass, but older tests were written before this column existed and expect the previous .obs shape. Suggested solution: To avoid breaking existing tests while keeping n_obs_aggregated functionality: Make the addition of n_obs_aggregated optional via a flag, e.g., add_n_obs_aggregated: bool = False in aggregate(). Only add the column when this flag is True. Update my two new tests to call aggregate(add_n_obs_aggregated=True). Existing tests will continue to pass without changes. This approach ensures backward compatibility and keeps the CI green. Happy to implement this if you agree with the approach. Just to clarify the other CI failures: Test step failing – This seems to be caused by environment setup or dependency issues in the Hatch environment. My PR does not touch test setup beyond aggregate(), so I don’t think this is caused by my changes. Benchmark workflow failing (igraph missing) – The benchmark jobs fail because igraph is not installed. My PR doesn’t use igraph, so this failure is unrelated. | 
| @flying-sheep is the addition of a column to an output  | 
| Hi @ilan-gold, just wanted to check in to see if there’s been any update or if there’s anything I can help with to move this forward. Thanks! | 
| @RohanDisa Still waiting for @flying-sheep to respond but in any case, the tests in https://github.com/scverse/scanpy/actions/runs/18199982179/job/51816237544?pr=3824 are failing so could you fix those? Even if we put this behind a flag, the tests will still be necessary | 
| Hey @ilan-gold, 
 Do let me know if any changes are needed on my side. | 
| Hey @ilan-gold, @flying-sheep, | 
| @RohanDisa I'd still like to hear from @flying-sheep on whether this should go in a minor version or not, but if we don't hear, i.e., does outputting a new column in  I will look at the unrelated failures (hopefully) soon | 
| I think it’s not a breaking change, but it is a feature and should therefore not go into a patch version. So I think the current milestone is correct! | 
This PR adds a new column
n_obs_aggregatedto the output ofsc.get.aggregate, reporting the total number of observations aggregated into each group. This helps users track how many cells or replicates contributed to each aggregated row.scanpy.get.aggregate#3822hatch run towncrier:create)Notes / Questions:
n_obs_aggregated(point 1).n_{by[i]}_aggregated— could you clarify with an example what these columns should contain? When grouping by["patient", "cell_type"], wouldn'tn_patient_aggregatedandn_cell_type_aggregatedalways be 1?