Skip to content

Conversation

@SwayamInSync
Copy link
Member

Similar implementation to #225 (only endianness does not apply here)

@seberg
Copy link
Member

seberg commented Nov 20, 2025

I do wonder if it makes sense to template this. Even the aligned/unaligned could possibly be templated with a tiny memcpy no-op or assignment helper (assuming the compiler will optimize things away)?

(It's just 4 times almost the same code, I guess? And even if you take care of some unicode shenanigans -- such as proper unicode whitespace check -- honestly, I don't think it matters speed wise to just use the unicode check also for bytes.)

@SwayamInSync
Copy link
Member Author

The input is different, loading from Py_UCS vs normal char * (so might need to do specialized template which again expands to same code size)
I think the processing logic might be modularized, I'll give it a shot here

Copy link
Member Author

@SwayamInSync SwayamInSync left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me 3 days, I was learning more about the compiler optimizations and here is the godbolt compiler explorer link to see for proof

if constexpr (Aligned) {
return *(const T *)ptr;
}
else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-1: Aligned is true
- There will be no runtime overhead of if constexpr

Case-2: Aligned is false
- The size is known at compile time (16 bytes)
- so it is like copying into a local variable with known alignment
- so compiler replaces the memcpy call with inline load instructions (movdqu)

@SwayamInSync
Copy link
Member Author

SwayamInSync commented Nov 30, 2025

@seberg if looks fine, then I'll be happy to refactor other loops in future PRs as well

@ngoldbaum
Copy link
Member

If there are any spots you'd particularly like review for that would help. This is a big diff!

@SwayamInSync
Copy link
Member Author

It actually became big because I also refactored the unicode casting code here (which was done in #225 )

So if you see the casts.cpp then there is no more different aligned and unaligned loops (for both bytes and unicode) we now template instantiate them by setting the Aligned template parameter to true and false for aligned and unaligned loops respectively and it correspondingly uses the method of load/store defined inside utilities.h

so I think just reviewing following will be good enough

  • newly load/store templates inside utilities.h here compiler can optimize out the memcpy
  • in casts.cpp just bytes_to_quad_strided_loop and quad_to_bytes_loop (the unicode part was already reviewed it just here it got refactor to use template as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants