Add flexible `aggregate` analysis layer #696

jkrumbiegel · 2025-10-30T21:38:47Z

This PR adds the aggregate analysis which can aggregate one or more mapped columns grouped by one or more other columns. The aggregation functions can be chosen freely and the plotting function should just be picked as usual with visual, so it's a very flexible analysis layer. The reason for this is that it's usually a bit annoying having to add data wrangling just for some simple visualizations that should be done on the fly, where you are not interested in keeping the aggregated data around. This way you don't have to come up with a variable name for it, plus it works with all table inputs and not just the typical DataFrame.

For example, let's say we have some categories and associated measurements. We can plot these as a normal scatter:

using AlgebraOfGraphics
using CairoMakie
using Statistics

df = (
    cats = repeat(["low", "mid", "high"], 30),
    vals = repeat([1, 2, 3], 30) .+ randn.(),
)

base = data(df) * mapping(:cats, :vals)
scat = base * visual(Scatter)
draw(scat)

Let's say we want to show the median of each group. We can do this with aggregate. Every mapped column needs to be either a grouping column or an aggregated column. Grouping columns are denoted by a :.

med = base * aggregate(:, median) * visual(Scatter, markersize = 20, color = :red)
draw(scat + med)

Each column can only have one function applied, but this function may return multiple values per group, for example as a tuple. There can then be multiple functions that are applied on the result, each of which can be assigned to a different output mapping. This can be used, for example, to draw error bars or confidence intervals. Let's compute the 25th and 75th percentiles and draw the interval.

interval = base * aggregate(
    :,
    (x -> quantile(x, [0.25, 0.75])) => [
        first => 2,
        last => 3,
    ]
) * visual(Rangebars, linewidth = 3, color = :red)

draw(scat + med + interval)

With => 2 and => 3 we assign the first and second quantile to positional mappings 2 and 3 for Rangebars. If you don't specify a remapping, the initial mapping is kept, but there can only be one output assigned to a mapping.

In this case, it might look nice to apply a dodge, so both components can be discriminated better.

with_dodge = scat * mapping(dodge_x = direct("A")) +
    (med + interval) * mapping(dodge_x = direct("B"))

draw(with_dodge, scales(DodgeX = (; width = 0.2)))

Grouping by multiple mappings also works, for example to compute a heatmap by summing all values of a given group (the empty cells are combinations of x and y that don't exist by chance):

df = (;
    x = rand(1:5, 100),
    y = rand(1:5, 100),
    z = randn(100)
)
data(df) * mapping(:x, :y, :z) * aggregate(:, :, sum) *
    visual(Heatmap) |> draw

Additionally, the outputs can be renamed with the pair syntax, plus you can assign a scale id like with a normal mapping, because aggregate freely creates new mapped columns:

data(df) * mapping(:x, :y, :z) *
    aggregate(
        :,
        :,
        sum => "Sum of all z values" => scale(:sumcolor)
    ) *
    visual(Heatmap) |>
    draw(scales(sumcolor = (; colormap = :plasma)))

jkrumbiegel added 30 commits October 30, 2025 16:37

second case with flipped dims works

c5f785f

named mappings can be aggregated

627ae81

fix case where aggregator returns union missing

3449046

switch to grouping via :

374332c

make aggregation under multiple groups work

6d34841

fix problem with missing group combinations

8e2e42b

remove printout

7e516b6

add rangebar case with splitting of extrema

3e0724f

add test case with preexisting grouping

7980f27

add ability to specify custom labels

f997dbb

allow any type as label (also RichText)

30efde7

handle rich text in aggregation label

0a13e01

allow changing scale ids

092396c

allow labelling and scale id'ing for each component in a split

fc41cae

use dispatch and not if else

2c205c4

make parsing code more explicit

16e3284

move first test

dce92de

move next test

1728064

add next test

cf71ea6

move third test

547bfdc

move next test

33d16e5

move next test

b1fff6c

move next test

7832bb4

move heatmap scale / label test

af03609

move last test

2464c19

add testcase with verbatim and categorical in split

8ca5682

add refimages

506b6b2

formatting

a193891

make aggregation into vectors also work

2c51fd6

add error for higher-dimensional outputs

29ea687

jkrumbiegel added 3 commits November 5, 2025 16:31

change to selective syntax (like highlight)

7077084

add no-group case

3a2a534

add tuple input column test case

135d6da

jkrumbiegel changed the base branch from master to jk/0.12 November 6, 2025 12:11

jkrumbiegel added 9 commits November 6, 2025 13:20

add label/scaleid to docstring

721f38a

run CI also on PRs not against master

76890f7

formatting

eefcb36

add changelog

22ae3c7

add docs

1c288ce

move using statistics

137a4e9

move docstring of aggregate to the actual function

1e56836

fix case with assignment without split

3e23eae

improve error for multiple output slots and add test

5edab6d

jkrumbiegel marked this pull request as ready for review November 6, 2025 13:46

jkrumbiegel merged commit 62095da into jk/0.12 Nov 6, 2025
7 checks passed

jkrumbiegel deleted the jk/aggregate branch November 6, 2025 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flexible `aggregate` analysis layer #696

Add flexible `aggregate` analysis layer #696

Uh oh!

jkrumbiegel commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add flexible aggregate analysis layer #696

Add flexible aggregate analysis layer #696

Uh oh!

Conversation

jkrumbiegel commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add flexible `aggregate` analysis layer #696

Add flexible `aggregate` analysis layer #696