Skip to content

Conversation

madsmtm
Copy link
Member

@madsmtm madsmtm commented Aug 12, 2025

Transactions are expensive, and the layer should be able to figure out the timing of when to render by itself (by virtue of being installed in a view). The only reason why we did it before was to avoid a fade transition between layer content changes.

Part of #83. I have not benchmarked this, but I have visibly confirmed less stuttering when resizing.

@madsmtm madsmtm added enhancement New feature or request CoreGraphics macOS/iOS/tvOS/watchOS/visionOS backend labels Aug 12, 2025
@madsmtm madsmtm changed the title Avoid the explicit CATransaction commit Avoid the explicit CATransaction Aug 12, 2025
@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from cecb0bc to e8ddecc Compare August 12, 2025 22:40
Comment on lines 12 to 16
use objc2_foundation::{
ns_string, NSDictionary, NSKeyValueChangeKey, NSKeyValueChangeNewKey,
NSKeyValueObservingOptions, NSNumber, NSObject, NSObjectNSKeyValueObserverRegistration,
NSKeyValueObservingOptions, NSNull, NSNumber, NSObject, NSObjectNSKeyValueObserverRegistration,
NSString, NSValue,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never liked these big mass imports. Any chance we could instead do:

use objc2_foundation as found;

...then do:

found::NSNull

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd rather avoid that, when reading the implementation, it doesn't actually matter much which framework a specific thing is from - and besides, everything is already prefixed ("NS", "CG" etc.).

I'd rather do:

use objc2_core_foundation::*;
use objc2_core_graphics::*;
use objc2_foundation::*;
use objc2_quartz_core::*;

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I guess objc2_* could have taken the same approach as cidre, e.g. cidre::ns::Null, which is admittedly much more "Rusty", though I decided not to, since it makes it harder to figure out what a specific type corresponds to underneath).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes the code later on harder to read for humans. My vote's still on having a module prefix. I take the same approach in new Win32 code that I write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I do something similar in the Windows part of the code.

If you think it's out of scope for this PR, I can file one later.

@nicoburns
Copy link

nicoburns commented Aug 13, 2025

Wow. This is dramatically faster (up to 1000x !!!) for me. I'm seeing present times measured in 10s of microseconds rather than 10s of milliseconds. Specifically: running Blitz (a winit application) on my 14" MacBook Pro (M1), I'm getting the following for the times to call surface_buffer.present().unwrap();:

Test softbuffer 0.4 This PR buffer_mut pixels 0.15
800x600 1x 1.5ms 17us 200us 500us
800x600 2x 6ms 25us 650us 1ms
1512x982 2x 18ms 30us 1.5ms 3ms

To reproduce:

Then:

  • To test softbuffer 0.4: cargo run -rp readme --no-default-features --features comrak,log_frame_times,log_phase_times,cpu-softbuffer .
  • To test this PR add the following the bottom of the root-level (workspace) Cargo.toml:
    [patch.crates-io]
    softbuffer = { git = "https://github.com/rust-windowing/softbuffer", branch = "cg-avoid-transaction" }
    and then run the same command as for softbuffer 0.4.
  • To test the pixels crate: cargo run -rp readme --no-default-features --features comrak,log_frame_times,log_phase_times,cpu-pixels .

You should then see output like:

Resolve: 11ms (style: 76us, construct: 10ms, flush: 32us, layout: 224us)
Frame time: 12ms (cmd: 11ms, flush: 105us, render: 277us, swizel: 336us, present: 19us)

It is the present time that is the call to softbuffer's (or pixels's) present.

@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from e8ddecc to 91c1904 Compare August 13, 2025 15:04
@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Well, to be fair, this PR is not actually doing any work in the present call any more.

The actual work now happens in buffer_mut (allocation of a new buffer, which is deallocated once the CGImage is no longer referenced) and internally somewhere in Apple's rendering pipeline (maybe -[CALayer display]). I think if you're going to benchmark this, you'll need to use Instruments.app, flamegraph, or some other whole-program benchmarking.

These are expensive, and the layer should be able to figure out the
timing of when to render by itself.
@madsmtm madsmtm force-pushed the cg-avoid-transaction branch from 91c1904 to e23a8fb Compare August 13, 2025 15:26
@nicoburns
Copy link

Well, to be fair, this PR is not actually doing any work in the present call any more.
The actual work now happens in buffer_mut ... and internally somewhere in Apple's rendering pipeline (maybe -[CALayer display]).

Hmm... I wasn't previously timing buffer_mut, but I just added it and the amount of time spent there doesn't seem to have changed much (~200us to ~1ms depending on buffer size for both 0.4 and this PR). I guess there could be some time being spent elsewhere within Apple frameworks that I'm not capturing. But this PR is visually much smoother for me, so I suspect it is genuinely faster overall.

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Hmm... I wasn't previously timing buffer_mut, but I just added it and the amount of time spent there doesn't seem to have changed much (~200us to ~1ms depending on buffer size for both 0.4 and this PR).

Sorry, that wasn't particularly clear; I meant that buffer_mut was, and still is, doing a large part of the work (and some of this work would be lessened by using IOSurface and/or swapping between buffers instead of reallocating).

But this PR is visually much smoother for me, so I suspect it is genuinely faster overall.

Definitely agree.

@MarijnS95
Copy link
Member

MarijnS95 commented Aug 13, 2025

Reading the definition of CATransaction, doesn't this simply offload/postpone the cost to somewhere else (e.g. the "when the thread’s runloop next iterates.")?

Or were we accidentally waiting for the transaction to have completed, while this new model allows multiple implicit transactions to be created and submitted "asynchronously"?

Just curious to map this to all other platforms' "compositor transaction" abstractions :)

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Reading the definition of CATransaction, doesn't this simply offload/postpone the cost to somewhere else (e.g. the "when the thread’s runloop next iterates.")?

I think that's true, yeah. Disassembling setContents:, I found that it locks the CATransaction and inserts the change into that (such that it will be applied later).

I suspect that the real problem is actually in the way that Winit schedules redraws such that they happen outside display/drawRect: (and thereby outside the transaction) in the first place (good old rust-windowing/winit#2640 strikes again).

@MarijnS95
Copy link
Member

MarijnS95 commented Aug 13, 2025

Thanks for looking into that! Some quick local testing shows that ::commit() on a MacBook Air M4 takes about 3.7ms on average on the animation example.

Could it be that this call is blocking, when applications use it directly? Assigning a completion handler shows that it completes at around the same time. Curious how you "disassembled" so that we can look into ::commit() instead.


Never ::begin()ing a new transaction shows that the completion handler is now called between 5-15ms (with some 21ms outliers) after setContents(), so I'm really curious if we're just trading some predicable/visible "CPU overhead" (or blocking? - need to attach a profiler) for increased latency?
EDIT: Adding a frame index shows that these transactions are completing out of order a bunch of times...

Note that animationDuration is equal to 0.25 by default (and the animationTimingFunction is None) - setting that to 0f64 shows a consistent delay of 250µs in the completion handler...

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Curious how you "disassembled" so that we can look into ::commit() instead.

I did lldb target/debug/examples/rectangle and set a few breakpoints.

Never ::begin()ing a new transaction

Uhh, pretty sure that's invalid use of the API, otherwise you may be committing work done by something higher in your call stack.

Could it be that this call is blocking, when applications use it directly? Assigning a completion handler shows that it completes at around the same time.

I'm really curious if we're just trading some predicable/visible "CPU overhead" (or blocking? - need to attach a profiler) for increased latency?

I don't completely know how CATransaction::commit works when outside of a draw call issued by the OS (such as -[NSView drawRect:]), but I think it submits the result to the compositor immediately. Testing on current master by calling .present() a hundred times per requested redraw in the rectangle example seems to back this theory up.

And yeah, with this PR, you will get a bit of latency here, in that the result is not actually presented immediately, but instead only presented the next time the OS renders.

(I'm pretty sure all of these issues just go away if the Winit issue was fixed, since then the CATransaction would know that it was run inside a draw call by the OS, and the commit wouldn't render immediately).

@MarijnS95
Copy link
Member

Apologies, I meant to also skip ::commit(), i.e. it the layer update and the callback to time this would run later in that thread's runloop that I linked before, and what you do in this PR.

Curious to see those complete out of order, some frames taking very long.

I should've assumed the debugger might have "enough" debug symbols to see what is going on under the hood 😅


And yeah, I've been wanting to write better present-timing abstractions in Winit for years. RedrawRequested is also fundamentally broken on Android. Curious if you get less delay on Mac if it's running at a closer time before vblank (or does it always run right after the previous vblank?).

@madsmtm
Copy link
Member Author

madsmtm commented Aug 13, 2025

Curious if you get less delay on Mac if it's running at a closer time before vblank (or does it always run right after the previous vblank?).

No idea honestly, and unsure of how I'd test it?

@MarijnS95
Copy link
Member

No idea honestly, and unsure of how I'd test it?

This is what I did, perhaps we could add it to that draw(Rect): callback and see how much of a delay it has, respectively?

let s = Instant::now();
unsafe { self.imp.layer.setContents(Some(image.as_ref())) };

static mut FRAME: u32 = 0;
let frame = unsafe { FRAME };
unsafe { FRAME += 1 };

unsafe {
    CATransaction::setCompletionBlock(Some(
        // Does this clone or otherwise reference the block? After the move,
        // the closure lifetime is 'static and could use StackBlock as well?
        block2::RcBlock::new(move || {
            println!("{frame:0>6}: {:?}", s.elapsed());
        })
        .deref(),
    ))
};

@nicoburns
Copy link

nicoburns commented Aug 20, 2025

This PR seems to have a memory problem. I am able to get memory usage to spike as high as 3GB+ with this PR just by scrolling my app (which it causes it to render frames). Interestingly, it does drop if I resize the window, but only down to ~700mb. Rendering this same app with pixels it sits at around 150mb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CoreGraphics macOS/iOS/tvOS/watchOS/visionOS backend enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

4 participants