Skip to content

Conversation

elmarco
Copy link

@elmarco elmarco commented Feb 12, 2025

The input format is not necessarily RGB or RGBA, and doing a pre-pass conversion can be quite costly (adding about 15-20% of total time from empirical study)

For simplicity reasons, I made sure not to break the existing API.

Those changes don't seem to affect the encoder performance in a significant way.

@elmarco
Copy link
Author

elmarco commented Feb 13, 2025

See also: Devolutions/IronRDP#670

@elmarco
Copy link
Author

elmarco commented Mar 18, 2025

@aldanor hi, wdyt?

@elmarco
Copy link
Author

elmarco commented Apr 15, 2025

@aldanor ping :thanks:

@elmarco
Copy link
Author

elmarco commented May 9, 2025

@aldanor are you still maintaining the crate? thanks

@Joshix-1
Copy link

The api feels kinda confusing too me. It would be cool to control Input format and output format independently. I have the use case that I have an image with RGBA data where I know that the image has no translucency, so I want to save it as RGB. I don't know how I would do this with this PR. Which is why I created #15

Would be cool if it was possible to control the output format independently like in my pr.

And I don't really understand the stride argument. That should maybe be documented better.

@elmarco
Copy link
Author

elmarco commented May 24, 2025

The api feels kinda confusing too me. It would be cool to control Input format and output format independently. I have the use case that I have an image with RGBA data where I know that the image has no translucency, so I want to save it as RGB. I don't know how I would do this with this PR. Which is why I created #15

You can use the RawChannels::Rgbx or Bgrx, or Xbgr etc.. from this PR for that.

Would be cool if it was possible to control the output format independently like in my pr.

Well, you can't really mix input and output formats freely. You need alpha channel input for alpha channel output. And QOI has only two output formats, rgb and rgba...

And I don't really understand the stride argument. That should maybe be documented better.

https://learn.microsoft.com/en-us/windows/win32/medfound/image-stride

@Joshix-1
Copy link

You can use the RawChannels::Rgbx or Bgrx, or Xbgr etc.. from this PR for that.

True, didn't think about that. But what if someone wants to add an alpha channel?

You need alpha channel input for alpha channel output.

Adding an alpha channel with default values of 255 would be imho fine and not unexpected.

@elmarco
Copy link
Author

elmarco commented May 27, 2025

You need alpha channel input for alpha channel output.

Adding an alpha channel with default values of 255 would be imho fine and not unexpected.

Well, this use case goes beyond the typical simple encoding imo.

@Joshix-1
Copy link

Joshix-1 commented May 28, 2025

Why is RawChannels::bytes_per_pixel not public? There should either be another Encoder constructor without a stride argument or bytes_per_pixel should be public. Otherwise the Encoder::new_raw API is imho too clunky to use if you have no stride.

@elmarco
Copy link
Author

elmarco commented May 28, 2025

Why is RawChannels::bytes_per_pixel not public? There should either be another Encoder constructor without a stride argument or bytes_per_pixel should be public. Otherwise the Encoder::new_raw API is imho too clunky to use if you have no stride.

We could make stride an Option. Alternatively, we could have an EncoderBuilder, but this might be overkill.

@Joshix-1
Copy link

Stride could even be Option<NonZeroUsize>, but Option<usize> would be fine as well

Copy link

@Joshix-1 Joshix-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the EncoderBuilder. It really improves the usability. Have a few nitpics

src/encode.rs Outdated
if stride * (height - 1) as usize + width as usize * raw_channels.bytes_per_pixel() < size {
return Err(Error::InvalidImageLength { size, width, height });
}
if guess_stride && size != width as usize * height as usize * raw_channels.bytes_per_pixel()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

width as usize * raw_channels.bytes_per_pixel() is a common element in the if statements, maybe that could be put in a variable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

width_usize ? hmm, I don't know if this is a common pattern. I am a bit unsure.

@Joshix-1 Joshix-1 mentioned this pull request Jul 1, 2025
elmarco added a commit to elmarco/qoi-rust that referenced this pull request Jul 22, 2025
 RFE: encoder: add stride & various raw input formats support aldanor#11
@aldanor
Copy link
Owner

aldanor commented Jul 28, 2025

@elmarco Added some comments, please lmk if you want to continue working on this pr or not?

Also, you'd need to rebase your branch after removing commits from this branch as lots of fixes are already on main:

  • bench cli fix (plus --stream mode addon)
  • fuzz package fix
  • large buffer fix
  • run-starting edgecase fix
  • there's also commits in this pr that seemingly belong to the fork like renaming the package

There's also a few questions:

  • can you run the benchmark before/after on the encoder and post the results, to make sure that double nested loop in the encoder isn't affecting things much?
  • should there be any tests with the streaming mode? (it should just work, but still)

Signed-off-by: Marc-André Lureau <[email protected]>
@elmarco
Copy link
Author

elmarco commented Jul 28, 2025

@aldanor thanks for the review

* can you run the benchmark before/after on the encoder and post the results, to make sure that double nested loop in the encoder isn't affecting things much?

This is the biggest problem, it seems I didn't benchmark properly. I get a -33% encoding perf. I am trying to understand where it comes from and how to fix it. thanks

@aldanor
Copy link
Owner

aldanor commented Jul 28, 2025

I am trying to understand where it comes from and how to fix it

It might come from a double loop with now-unknown number of iterations in each preventing some common loop optimizations? I guess you can godbolt something equivalent.

Yea, -33% is pretty bad for sure... -3% would still be not nice but within noise bounds.

@elmarco
Copy link
Author

elmarco commented Jul 28, 2025

Yea, -33% is pretty bad for sure... -3% would still be not nice but within noise bounds.

it turns out it's the assert_eq!(), this adds a bunch of panic handling code and prevent some optimizations. I switched it to debug_assert!(). Now, no performance regression is observable (with perf stat).

@elmarco elmarco force-pushed the raw branch 3 times, most recently from 9fd810a to 5b9aa9b Compare July 28, 2025 17:13
@Joshix-1
Copy link

assert_eq

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        279.8       381.8
         MB/s        876.6      1196.4
--------------------------------------
encode   Mp/s        224.4       206.9
         MB/s        703.3       648.2
--------------------------------------

debug_assert_eq

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        280.2       381.9
         MB/s        878.0      1196.6
--------------------------------------
encode   Mp/s        225.6       208.9
         MB/s        706.9       654.7
--------------------------------------

81c14c4

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        278.7       364.3
         MB/s        873.4      1141.6
--------------------------------------
encode   Mp/s        218.3       254.8
         MB/s        684.2       798.3
--------------------------------------

For me it's still slower with debug_assert_eq

@elmarco
Copy link
Author

elmarco commented Jul 28, 2025

did you compile with --release ?

@Joshix-1
Copy link

I ran cargo run --release ../assets/ in the bench folder

This extra feature allows to turn on/off the extra input formats and
shows that encode_impl() isn't correctly optimized independently of the
various existing formats. This should probably be reported or analyzed
by the compiler team. At least, I am not able to explain the reason.

Signed-off-by: Marc-André Lureau <[email protected]>
@elmarco
Copy link
Author

elmarco commented Jul 28, 2025

(I switched to stable rust, as I get slightly different results with nightly which can create confusion)

@Joshix-1 try with the latest commit. You should not see performance loss (I actually observe a slight improvement something like 0.5%). Something is weird when enabling the "extra-source" feature, the encode_impl() don't get optimized the same way and we can observe -5-10%. It seems to be related to the Fn / closure somehow. But each format or fn specialization should be receiving an independant analysis and optimization no? It may be worth to ask/report to the compiler team. In the mean time, perhaps the "extra-source" feature flag is acceptable?

@Joshix-1
Copy link

ce61683e8682edb68f8f040f4cbbce1eae143fb7 has a similar effect. Seems to be some weird case where the compiler doesn't optimize

@elmarco
Copy link
Author

elmarco commented Jul 28, 2025

@Joshix-1 I am really curious to know how you found about adding this extra E/Infallible !

@Joshix-1
Copy link

Created a flamegraph and saw Try take a big chunk.
grafik

@elmarco
Copy link
Author

elmarco commented Jul 29, 2025

I tried to make a subset test case to report to the compiler without success atm.

@Joshix-1 what to do next?:

  • use your workaround
  • add the feature flag I proposed
  • look for other solution (is stream writer really useful?)
  • wait for a compiler fix

thanks

@Joshix-1
Copy link

I think my commit could also improve the performance of master. Which would mean that this pr has to get even faster.

I have some ideas I want to test out. I'll do some more testing an benchmarking. Stream writer is imho useful, I don't see a reason to remove it

@Joshix-1
Copy link

Could not improve performance of master. I would suggest using 50293f3

With

[profile.release]
opt-level = 3
debug = true

in bench/Cargo.toml I get

$ cargo run --release ../assets/
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        274.2       383.2
         MB/s        859.3      1200.9
--------------------------------------
encode   Mp/s        220.7       260.4
         MB/s        691.7       816.0
--------------------------------------

without the debug = true I get

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        280.9       370.7
         MB/s        880.2      1161.4
--------------------------------------
encode   Mp/s        226.5       252.6
         MB/s        709.8       791.6
--------------------------------------

on master I get:

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        284.8       368.8
         MB/s        892.4      1155.7
--------------------------------------
encode   Mp/s        220.2       257.1
         MB/s        690.0       805.7
--------------------------------------

But I don't really know anymore, it's all really weird and unexpected

@elmarco
Copy link
Author

elmarco commented Jul 29, 2025

indeed, I am ok with 50293f3, we can later revert it.

@Joshix-1
Copy link

Why revert it later? The change makes sense. Infallible methods shouldn't return results that could contain errors

@elmarco
Copy link
Author

elmarco commented Jul 29, 2025

The compiler should be able to infer that BytesMut Writer is in fact Infallable when inlining, I think that's what it's doing when "extra-source" is off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants