Skip to content

Conversation

@MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Dec 19, 2025

Summary

Add more support for correctness checking.
Each kernel now can set its own tolerance and we print if the checksum met that tolerance in the show-progress screen output and the checksum output file.
This mechanism is now also used by the test executable.

How concerned should we be that checksums grow over multiple passes?

Example Output

Here is an example of -sp screen output.

Run kernel -- Basic_MULTI_REDUCE
        Running Base_Seq variant
                Running      default tuning -- 2.71738e-05 sec. x 50 rep. PASSED checksum
        Running Lambda_Seq variant
                Running      default tuning -- 2.71464e-05 sec. x 50 rep. PASSED checksum
        Running RAJA_Seq variant
                Running      default tuning -- 2.71234e-05 sec. x 50 rep. PASSED checksum
        Running Base_OpenMP variant
                Running      default tuning -- 0.000128336 sec. x 50 rep. PASSED checksum
        Running Lambda_OpenMP variant
                Running      default tuning -- 0.000129231 sec. x 50 rep. PASSED checksum
        Running RAJA_OpenMP variant
                Running      default tuning -- 1.70671e-05 sec. x 50 rep. FAILED checksum

Here is an example of checksum.txt file output.

Kernel                     
........................................................
Variants                   Result  Tolerance                   Average Checksum            Max Checksum Diff           Checksum Diff StdDev        
                                                                                           (vs. first variant listed)                              
----------------------------------------------------------------------------------------
Basic_MULTI_REDUCE         
........................................................
Base_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_Seq-default         PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Base_OpenMP-default        PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_OpenMP-default      PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_OpenMP-default        FAILED  9.9999999999999995475e-08   2918.7998532503377369       2946.4701746910394893       2.2204460492503130808e-16 

-------------------------------------------------------

@MrBurmark MrBurmark requested review from a team and rhornung67 December 19, 2025 18:11
Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either here or in another PR, it may be a good idea to note in the Dev Guide some things to think about when setting the tolerance for a kernel in the code. Maybe a brief explanation of how you determined the tolerance to set for a couple of representative kernels with different tolerances.

@rhornung67
Copy link
Member

We've tried to deal with growing checksums by adding a multiplier to keep their magnitude reasonable. We also a Kahan sum approach for summing checksum values. We haven't looked at this in a while, at least I haven't. Maybe re-investigate since problem sizes are getting larger.

@artv3
Copy link
Member

artv3 commented Dec 23, 2025

Do you mean multiple passes over the suite? or within the number of times the kernel is invoked?

@MrBurmark
Copy link
Member Author

MrBurmark commented Dec 23, 2025

The checksums are added up over passes as well and the checksum scaling factor does not account for that effect.

@MrBurmark
Copy link
Member Author

MrBurmark commented Dec 26, 2025

I updated the documentation for checksum consistency and added some documentation for the checksum tolerance and scaling factor. @rhornung67

@MrBurmark MrBurmark requested a review from rhornung67 December 26, 2025 21:00
@rhornung67
Copy link
Member

@MrBurmark what do you think about dividing the checksum increment by the # passes (or multiplying by floating pt value 1.0/npasses) each time the checksum is updated for each kernel? This could be hidden in the KernelBase class method, doesn't add much additional arithmetic, and would guarantee that the checksums for a run would always be the same magnitude regardless of # passes.

@MrBurmark
Copy link
Member Author

@rhornung67 Ya we could do something like that. How about dividing by the numbers of times the kernel was run in the getChecksum method?

@MrBurmark
Copy link
Member Author

@rhornung67 I added the division.

@MrBurmark
Copy link
Member Author

MrBurmark commented Jan 5, 2026

Thinking about the checksum a bit more and how to make it independent of the number of passes. How about we keep a sum of all the checksums, the min, and max? Then we could get the largest difference from the mean reference checksum at the end. That would avoid downplaying errors that only occurred in one pass.

@MrBurmark
Copy link
Member Author

Regarding checksums and npasses when deciding on a reference checksum I chose to use the min checksum of the reference variant tuning for consistent kernels as the checksums should all be the same anyway and I chose to use the average checksum of the reference variant tuning for other kernels. Would it be simpler if I just used the first checksum of the reference variant tuning?

@MrBurmark
Copy link
Member Author

I went ahead and switched to use the first pass of the kernel as the reference checksum.

@MrBurmark MrBurmark enabled auto-merge January 7, 2026 17:59
@MrBurmark MrBurmark merged commit ae8a4b7 into develop Jan 7, 2026
17 checks passed
@rhornung67 rhornung67 deleted the feature/burmark1/correctness branch January 7, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants