feat: add n_obs_aggregated column to aggregated AnnData output #3824

RohanDisa · 2025-10-01T04:31:38Z

This PR adds a new column n_obs_aggregated to the output of sc.get.aggregate, reporting the total number of observations aggregated into each group. This helps users track how many cells or replicates contributed to each aggregated row.

Closes Output numer of observations aggregated from scanpy.get.aggregate #3822
Tests included (unit tests for single-key and multi-key groupings, as well as empty group handling)
Release notes added (via hatch run towncrier:create)

Notes / Questions:

I have implemented n_obs_aggregated (point 1).
Regarding point 2 about n_{by[i]}_aggregated — could you clarify with an example what these columns should contain? When grouping by ["patient", "cell_type"], wouldn't n_patient_aggregated and n_cell_type_aggregated always be 1?

for more information, see https://pre-commit.ci

ilan-gold · 2025-10-01T14:27:43Z

Wow @RohanDisa thanks for taking this on!

Regarding point 2 about n_{by[i]}_aggregated — could you clarify with an example what these columns should contain? When grouping by ["patient", "cell_type"], wouldn't n_patient_aggregated and n_cell_type_aggregated always be 1?

Hmmm yes the way I wrote that is not helpful at all. It should have been more like n_obs_per_patient_aggregated and n_obs_cell_type_aggregated so if you have 4 cells aggregated in a patient-cell type group, but there are 10 cells in the patient (across cell types) and 12 in the cell type (across patients), the 8 would be n_obs_aggregated, the 12 would have be n_obs_cell_type_aggregated and the 10 n_obs_per_patient_aggregated.

The idea would be to give a way to get the fraction of cells aggregated in a given cell-patient combination. Perhaps this is actually a better metric.

So I think then fraction_of_cells and fraction_of_patients might be a more helpful metric here i.e., .33 and .4 in the above example.

However, I think it's fine to leave this out actually and only add n_obs_aggregated.

ilan-gold · 2025-10-01T14:28:03Z

tests/test_aggregated.py

+    # Counts should be positive
+    assert (result.obs["n_obs_aggregated"] > 0).all()
+    # Total counts should equal original n_obs
+    assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs


I think this is wrong, you'd want the per-louvain group counts

ilan-gold · 2025-10-01T14:28:09Z

tests/test_aggregated.py

+    )
+    assert "n_obs_aggregated" in result.obs
+    # Still sums back to the total number of obs
+    assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs


Same on the count check

ilan-gold · 2025-10-01T14:29:06Z

tests/test_aggregated.py

+    result = sc.get.aggregate(
+        pbmc_adata, by=["louvain", "percent_mito_binned"], func="mean"
+    )


It doesn't make sense to aggregate percent_mito_binned because it is not a categorical

ilan-gold · 2025-10-01T14:29:22Z

tests/test_aggregated.py

+    # Check column exists
+    assert "n_obs_aggregated" in result.obs
+    # Counts should be positive
+    assert (result.obs["n_obs_aggregated"] > 0).all()


No need to check this given the below check

ilan-gold · 2025-10-01T14:29:43Z

tests/test_aggregated.py

+    assert "n_obs_aggregated" in result.obs
+    # Only groups with data should appear
+    assert set(result.obs["fake_group"]) == {"A", "B"}


No need to check these

ilan-gold · 2025-10-01T14:29:51Z

tests/test_aggregated.py

+    # Only groups with data should appear
+    assert set(result.obs["fake_group"]) == {"A", "B"}
+    # Count check
+    assert result.obs["n_obs_aggregated"].sum() == pbmc_adata.n_obs


Same on the count check

Use hardcoded expected values instead of duplicating implementation logic for more reliable test verification.

for more information, see https://pre-commit.ci

RohanDisa · 2025-10-02T04:46:33Z

Hey @ilan-gold, Right now I’ve written tests using synthetic toy AnnData objects where the group counts are known exactly. This makes it easy to validate n_obs_aggregated.

Do you think I should also add a test using the small PBMC test dataset (pbmc3k_parametrized_small) to ensure things behave correctly on a real dataset, or are the current toy-data tests sufficient?

ilan-gold · 2025-10-02T11:59:39Z

Hey @ilan-gold, Right now I’ve written tests using synthetic toy AnnData objects where the group counts are known exactly. This makes it easy to validate n_obs_aggregated.

I think that's fine actually.

ilan-gold · 2025-10-02T17:35:21Z

@RohanDisa sorry about the unhelpful test failures previously but now you should be getting helpful ones :) they look simple enough

RohanDisa · 2025-10-02T22:08:43Z

Hey @ilan-gold,

I wanted to provide some context about the CI failures in the aggregation tests. The issue is caused by the recent addition of the n_obs_aggregated column in aggregate():

group_sizes = pd.Series(categorical).value_counts().reindex(new_label_df.index)
new_label_df["n_obs_aggregated"] = group_sizes.values

This adds an extra column to .obs, which is why existing aggregation tests are failing with shape mismatches, e.g.:

[left]: (3, 1) [right]: (3, 2)

The last two tests I added specifically check for n_obs_aggregated and pass, but older tests were written before this column existed and expect the previous .obs shape.

Suggested solution:

To avoid breaking existing tests while keeping n_obs_aggregated functionality:

Make the addition of n_obs_aggregated optional via a flag, e.g., add_n_obs_aggregated: bool = False in aggregate().

Only add the column when this flag is True.

Update my two new tests to call aggregate(add_n_obs_aggregated=True).

Existing tests will continue to pass without changes.

This approach ensures backward compatibility and keeps the CI green.

Happy to implement this if you agree with the approach.

Just to clarify the other CI failures:

Test step failing – This seems to be caused by environment setup or dependency issues in the Hatch environment. My PR does not touch test setup beyond aggregate(), so I don’t think this is caused by my changes.

Benchmark workflow failing (igraph missing) – The benchmark jobs fail because igraph is not installed. My PR doesn’t use igraph, so this failure is unrelated.

ilan-gold · 2025-10-02T22:15:38Z

@flying-sheep is the addition of a column to an output obs pandas.DataFrame a breaking change?

RohanDisa · 2025-10-09T20:55:03Z

Hi @ilan-gold, just wanted to check in to see if there’s been any update or if there’s anything I can help with to move this forward. Thanks!

ilan-gold · 2025-10-10T09:56:23Z

@RohanDisa Still waiting for @flying-sheep to respond but in any case, the tests in https://github.com/scverse/scanpy/actions/runs/18199982179/job/51816237544?pr=3824 are failing so could you fix those? Even if we put this behind a flag, the tests will still be necessary

for more information, see https://pre-commit.ci

RohanDisa · 2025-10-12T08:04:41Z

Hey @ilan-gold,
So, I updated the code. The reason for the only two issues are:

The benchmark job encountered an ImportError because the igraph package is missing.
The ReadTheDocs build failed because of multiple unresolved Sphinx references to typing.Union. With fail_on_warning: true enabled, these warnings are treated as errors, causing the documentation build to fail even though the docs compile successfully otherwise.

Do let me know if any changes are needed on my side.

RohanDisa · 2025-10-23T05:27:12Z

Hey @ilan-gold, @flying-sheep,
Just checking in to see if there’s anything else needed on my end.
Would love to get this merged if it looks okay to you guys!

ilan-gold · 2025-10-23T10:05:02Z

@RohanDisa I'd still like to hear from @flying-sheep on whether this should go in a minor version or not, but if we don't hear, i.e., does outputting a new column in obs constitute a "feature" that would go into a minor instead of patch (like adding a new kwarg or something).

I will look at the unrelated failures (hopefully) soon

flying-sheep · 2025-10-30T13:25:04Z

I think it’s not a breaking change, but it is a feature and should therefore not go into a patch version. So I think the current milestone is correct!

RohanDisa and others added 4 commits September 30, 2025 20:48

Add n_obs_aggregated column to aggregated AnnData output

dc7d517

tests: add coverage for n_obs_aggregated in aggregate

ecb957d

release notes: add n_obs_aggregated for sc.get.aggregate

230cedd

[pre-commit.ci] auto fixes from pre-commit.com hooks

f0b4721

for more information, see https://pre-commit.ci

ilan-gold requested changes Oct 1, 2025

View reviewed changes

RohanDisa and others added 2 commits October 1, 2025 21:39

test: improve n_obs_aggregated tests with ground truth data

637520e

Use hardcoded expected values instead of duplicating implementation logic for more reliable test verification.

[pre-commit.ci] auto fixes from pre-commit.com hooks

8eecad0

for more information, see https://pre-commit.ci

flying-sheep added 2 commits October 2, 2025 11:59

Re-format relnote

67ae802

clearer relnote

863639e

flying-sheep changed the title ~~Add n_obs_aggregated column to aggregated AnnData output~~ feat: add n_obs_aggregated column to aggregated AnnData output Oct 2, 2025

Merge branch 'main' into feat/add-n-obs-aggregated

1b7b162

ilan-gold added this to the 1.12.0 milestone Oct 2, 2025

RohanDisa and others added 3 commits October 11, 2025 03:35

Fix tests for n_obs_aggregated

aee5134

Merge remote changes

360e6b6

[pre-commit.ci] auto fixes from pre-commit.com hooks

2780054

for more information, see https://pre-commit.ci

feat: add n_obs_aggregated column to aggregated AnnData output #3824

Are you sure you want to change the base?

feat: add n_obs_aggregated column to aggregated AnnData output #3824

Uh oh!

Conversation

RohanDisa commented Oct 1, 2025

Uh oh!

ilan-gold commented Oct 1, 2025

Uh oh!

ilan-gold Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

RohanDisa commented Oct 2, 2025

Uh oh!

ilan-gold commented Oct 2, 2025

Uh oh!

ilan-gold commented Oct 2, 2025

Uh oh!

RohanDisa commented Oct 2, 2025

Uh oh!

ilan-gold commented Oct 2, 2025

Uh oh!

RohanDisa commented Oct 9, 2025

Uh oh!

ilan-gold commented Oct 10, 2025

Uh oh!

RohanDisa commented Oct 12, 2025

Uh oh!

RohanDisa commented Oct 23, 2025

Uh oh!

ilan-gold commented Oct 23, 2025

Uh oh!

flying-sheep commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants