Skip to content

Conversation

@RohanDisa
Copy link

This PR adds a new column n_obs_aggregated to the output of sc.get.aggregate, reporting the total number of observations aggregated into each group. This helps users track how many cells or replicates contributed to each aggregated row.

Notes / Questions:

  • I have implemented n_obs_aggregated (point 1).
  • Regarding point 2 about n_{by[i]}_aggregated — could you clarify with an example what these columns should contain? When grouping by ["patient", "cell_type"], wouldn't n_patient_aggregated and n_cell_type_aggregated always be 1?

@ilan-gold
Copy link
Contributor

Wow @RohanDisa thanks for taking this on!

Regarding point 2 about n_{by[i]}_aggregated — could you clarify with an example what these columns should contain? When grouping by ["patient", "cell_type"], wouldn't n_patient_aggregated and n_cell_type_aggregated always be 1?

Hmmm yes the way I wrote that is not helpful at all. It should have been more like n_obs_per_patient_aggregated and n_obs_cell_type_aggregated so if you have 4 cells aggregated in a patient-cell type group, but there are 10 cells in the patient (across cell types) and 12 in the cell type (across patients), the 8 would be n_obs_aggregated, the 12 would have be n_obs_cell_type_aggregated and the 10 n_obs_per_patient_aggregated.

The idea would be to give a way to get the fraction of cells aggregated in a given cell-patient combination. Perhaps this is actually a better metric.

So I think then fraction_of_cells and fraction_of_patients might be a more helpful metric here i.e., .33 and .4 in the above example.

However, I think it's fine to leave this out actually and only add n_obs_aggregated.

# Counts should be positive
assert (result.obs["n_obs_aggregated"] > 0).all()
# Total counts should equal original n_obs
assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong, you'd want the per-louvain group counts

)
assert "n_obs_aggregated" in result.obs
# Still sums back to the total number of obs
assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same on the count check

Comment on lines 563 to 565
result = sc.get.aggregate(
pbmc_adata, by=["louvain", "percent_mito_binned"], func="mean"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to aggregate percent_mito_binned because it is not a categorical

# Check column exists
assert "n_obs_aggregated" in result.obs
# Counts should be positive
assert (result.obs["n_obs_aggregated"] > 0).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check this given the below check

Comment on lines 577 to 579
assert "n_obs_aggregated" in result.obs
# Only groups with data should appear
assert set(result.obs["fake_group"]) == {"A", "B"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check these

# Only groups with data should appear
assert set(result.obs["fake_group"]) == {"A", "B"}
# Count check
assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same on the count check

RohanDisa and others added 2 commits October 1, 2025 21:39
Use hardcoded expected values instead of duplicating implementation
logic for more reliable test verification.
@RohanDisa
Copy link
Author

Hey @ilan-gold, Right now I’ve written tests using synthetic toy AnnData objects where the group counts are known exactly. This makes it easy to validate n_obs_aggregated.

Do you think I should also add a test using the small PBMC test dataset (pbmc3k_parametrized_small) to ensure things behave correctly on a real dataset, or are the current toy-data tests sufficient?

@flying-sheep flying-sheep changed the title Add n_obs_aggregated column to aggregated AnnData output feat: add n_obs_aggregated column to aggregated AnnData output Oct 2, 2025
@ilan-gold
Copy link
Contributor

Hey @ilan-gold, Right now I’ve written tests using synthetic toy AnnData objects where the group counts are known exactly. This makes it easy to validate n_obs_aggregated.

I think that's fine actually.

@ilan-gold
Copy link
Contributor

@RohanDisa sorry about the unhelpful test failures previously but now you should be getting helpful ones :) they look simple enough

@ilan-gold ilan-gold added this to the 1.12.0 milestone Oct 2, 2025
@RohanDisa
Copy link
Author

Hey @ilan-gold,

I wanted to provide some context about the CI failures in the aggregation tests. The issue is caused by the recent addition of the n_obs_aggregated column in aggregate():

group_sizes = pd.Series(categorical).value_counts().reindex(new_label_df.index)
new_label_df["n_obs_aggregated"] = group_sizes.values

This adds an extra column to .obs, which is why existing aggregation tests are failing with shape mismatches, e.g.:

[left]: (3, 1) [right]: (3, 2)

The last two tests I added specifically check for n_obs_aggregated and pass, but older tests were written before this column existed and expect the previous .obs shape.

Suggested solution:

To avoid breaking existing tests while keeping n_obs_aggregated functionality:

Make the addition of n_obs_aggregated optional via a flag, e.g., add_n_obs_aggregated: bool = False in aggregate().

Only add the column when this flag is True.

Update my two new tests to call aggregate(add_n_obs_aggregated=True).

Existing tests will continue to pass without changes.

This approach ensures backward compatibility and keeps the CI green.

Happy to implement this if you agree with the approach.

Just to clarify the other CI failures:

Test step failing – This seems to be caused by environment setup or dependency issues in the Hatch environment. My PR does not touch test setup beyond aggregate(), so I don’t think this is caused by my changes.

Benchmark workflow failing (igraph missing) – The benchmark jobs fail because igraph is not installed. My PR doesn’t use igraph, so this failure is unrelated.

@ilan-gold
Copy link
Contributor

@flying-sheep is the addition of a column to an output obs pandas.DataFrame a breaking change?

@RohanDisa
Copy link
Author

Hi @ilan-gold, just wanted to check in to see if there’s been any update or if there’s anything I can help with to move this forward. Thanks!

@ilan-gold
Copy link
Contributor

@RohanDisa Still waiting for @flying-sheep to respond but in any case, the tests in https://github.com/scverse/scanpy/actions/runs/18199982179/job/51816237544?pr=3824 are failing so could you fix those? Even if we put this behind a flag, the tests will still be necessary

@RohanDisa
Copy link
Author

Hey @ilan-gold,
So, I updated the code. The reason for the only two issues are:

  1. The benchmark job encountered an ImportError because the igraph package is missing.
  2. The ReadTheDocs build failed because of multiple unresolved Sphinx references to typing.Union. With fail_on_warning: true enabled, these warnings are treated as errors, causing the documentation build to fail even though the docs compile successfully otherwise.

Do let me know if any changes are needed on my side.

@RohanDisa
Copy link
Author

Hey @ilan-gold, @flying-sheep,
Just checking in to see if there’s anything else needed on my end.
Would love to get this merged if it looks okay to you guys!

@ilan-gold
Copy link
Contributor

@RohanDisa I'd still like to hear from @flying-sheep on whether this should go in a minor version or not, but if we don't hear, i.e., does outputting a new column in obs constitute a "feature" that would go into a minor instead of patch (like adding a new kwarg or something).

I will look at the unrelated failures (hopefully) soon

@flying-sheep
Copy link
Member

I think it’s not a breaking change, but it is a feature and should therefore not go into a patch version. So I think the current milestone is correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Output numer of observations aggregated from scanpy.get.aggregate

3 participants