Correctness Checking #608

MrBurmark · 2025-12-19T18:11:41Z

Summary

Add more support for correctness checking.
Each kernel now can set its own tolerance and we print if the checksum met that tolerance in the show-progress screen output and the checksum output file.
This mechanism is now also used by the test executable.

How concerned should we be that checksums grow over multiple passes?

This PR is a feature
It does the following (modify list as needed):
- Adds correctness checking at the request of Correctness/Robustness Checking #604
- This needs to update RAJA once Add kahan sum reduce helper class. RAJA#1969 is merged

Example Output

Here is an example of -sp screen output.

Run kernel -- Basic_MULTI_REDUCE
        Running Base_Seq variant
                Running      default tuning -- 2.71738e-05 sec. x 50 rep. PASSED checksum
        Running Lambda_Seq variant
                Running      default tuning -- 2.71464e-05 sec. x 50 rep. PASSED checksum
        Running RAJA_Seq variant
                Running      default tuning -- 2.71234e-05 sec. x 50 rep. PASSED checksum
        Running Base_OpenMP variant
                Running      default tuning -- 0.000128336 sec. x 50 rep. PASSED checksum
        Running Lambda_OpenMP variant
                Running      default tuning -- 0.000129231 sec. x 50 rep. PASSED checksum
        Running RAJA_OpenMP variant
                Running      default tuning -- 1.70671e-05 sec. x 50 rep. FAILED checksum

Here is an example of checksum.txt file output.

Kernel                     
........................................................
Variants                   Result  Tolerance                   Average Checksum            Max Checksum Diff           Checksum Diff StdDev        
                                                                                           (vs. first variant listed)                              
----------------------------------------------------------------------------------------
Basic_MULTI_REDUCE         
........................................................
Base_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_Seq-default         PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_Seq-default           PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Base_OpenMP-default        PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
Lambda_OpenMP-default      PASSED  9.9999999999999995475e-08   5865.2700279413772306       0.0000000000000000000       0.0000000000000000000       
RAJA_OpenMP-default        FAILED  9.9999999999999995475e-08   2918.7998532503377369       2946.4701746910394893       2.2204460492503130808e-16 

-------------------------------------------------------

…rk1/correctness

rhornung67

Either here or in another PR, it may be a good idea to note in the Dev Guide some things to think about when setting the tolerance for a kernel in the code. Maybe a brief explanation of how you determined the tolerance to set for a couple of representative kernels with different tolerances.

rhornung67 · 2025-12-19T21:11:19Z

We've tried to deal with growing checksums by adding a multiplier to keep their magnitude reasonable. We also a Kahan sum approach for summing checksum values. We haven't looked at this in a while, at least I haven't. Maybe re-investigate since problem sizes are getting larger.

artv3 · 2025-12-23T18:58:01Z

Do you mean multiple passes over the suite? or within the number of times the kernel is invoked?

MrBurmark · 2025-12-23T20:30:02Z

The checksums are added up over passes as well and the checksum scaling factor does not account for that effect.

SetChecksumTolerance in all kernels for consistency.

MrBurmark · 2025-12-26T18:48:33Z

I updated the documentation for checksum consistency and added some documentation for the checksum tolerance and scaling factor. @rhornung67

docs/sphinx/dev_guide/kernel_class_impl.rst

src/algorithm/ATOMIC.cpp

Do not print the whole checksum.

rhornung67 · 2026-01-05T21:36:24Z

@MrBurmark what do you think about dividing the checksum increment by the # passes (or multiplying by floating pt value 1.0/npasses) each time the checksum is updated for each kernel? This could be hidden in the KernelBase class method, doesn't add much additional arithmetic, and would guarantee that the checksums for a run would always be the same magnitude regardless of # passes.

MrBurmark · 2026-01-05T21:56:08Z

@rhornung67 Ya we could do something like that. How about dividing by the numbers of times the kernel was run in the getChecksum method?

…rk1/correctness

MrBurmark · 2026-01-05T22:26:30Z

@rhornung67 I added the division.

MrBurmark · 2026-01-05T22:30:58Z

Thinking about the checksum a bit more and how to make it independent of the number of passes. How about we keep a sum of all the checksums, the min, and max? Then we could get the largest difference from the mean reference checksum at the end. That would avoid downplaying errors that only occurred in one pass.

…rk1/correctness

MrBurmark · 2026-01-06T16:26:40Z

Regarding checksums and npasses when deciding on a reference checksum I chose to use the min checksum of the reference variant tuning for consistent kernels as the checksums should all be the same anyway and I chose to use the average checksum of the reference variant tuning for other kernels. Would it be simpler if I just used the first checksum of the reference variant tuning?

MrBurmark · 2026-01-06T18:14:36Z

I went ahead and switched to use the first pass of the kernel as the reference checksum.

MrBurmark added 3 commits December 19, 2025 10:06

Add checksum tolerance to KernelBase

610c16c

Use checksum tolerance in outputs

e1fad9a

Merge branch 'develop' of github.com:LLNL/RAJAPerf into feature/burma…

cbdaf94

…rk1/correctness

MrBurmark requested review from a team and rhornung67 December 19, 2025 18:11

rhornung67 approved these changes Dec 19, 2025

View reviewed changes

MrBurmark added 6 commits December 26, 2025 10:14

Use a setter for checksum_tolerance

eaa7635

SetChecksumTolerance in all kernels for consistency.

Fix use of local checksum_scale_factor in EDGE3D

e45f8a9

Use setChecksumScaleFactor

9e13e4b

Hide checksum_scale_factor in KernelBase

5acfc23

Update checksum documentation

bfe597c

Unremove POLYBENCH_FLOYD_WARSHALL checksum scale factor

f837902

artv3 reviewed Dec 26, 2025

View reviewed changes

docs/sphinx/dev_guide/kernel_class_impl.rst Show resolved Hide resolved

artv3 reviewed Dec 26, 2025

View reviewed changes

src/algorithm/ATOMIC.cpp Show resolved Hide resolved

MrBurmark mentioned this pull request Dec 26, 2025

Add kahan sum reduce helper class. llnl/RAJA#1969

Merged

MrBurmark added 2 commits December 26, 2025 12:50

Print checksum tolerance to checksum output file

e38359e

Only print pass/fail to -sp output

d018d0d

Do not print the whole checksum.

MrBurmark requested a review from rhornung67 December 26, 2025 21:00

MrBurmark added 2 commits January 5, 2026 13:59

Merge branch 'develop' of github.com:LLNL/RAJAPerf into feature/burma…

a3494c6

…rk1/correctness

divide checksum by number of execs

e71b6fe

MrBurmark added 2 commits January 5, 2026 15:49

Merge branch 'develop' of github.com:LLNL/RAJAPerf into feature/burma…

e4e85cd

…rk1/correctness

Change how we add to checksum and get checksum

110971e

MrBurmark added 4 commits January 6, 2026 08:09

Use DataSpace::Host instead of Base_Seq to get host memory

32db89f

Fix wrong space used in addToChecksum

0376855

Use RAJAPERF_UNUSED_ARG instead of casting to void

342bff4

Merge branch 'develop' of github.com:LLNL/RAJAPerf into feature/burma…

a71aa17

…rk1/correctness

MrBurmark added 2 commits January 6, 2026 09:13

Use RAJA::KahanSum in calcChecksumImpl

5abcdc3

Use the first pass for the reference checksum

641673e

MrBurmark and others added 2 commits January 6, 2026 13:47

Use RAJA develop

407a462

Merge branch 'develop' into feature/burmark1/correctness

7f7508c

rhornung67 approved these changes Jan 7, 2026

View reviewed changes

MrBurmark enabled auto-merge January 7, 2026 17:59

MrBurmark merged commit ae8a4b7 into develop Jan 7, 2026
17 checks passed

rhornung67 deleted the feature/burmark1/correctness branch January 7, 2026 18:26

Correctness Checking #608

Correctness Checking #608

Uh oh!

Conversation

MrBurmark commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example Output

Uh oh!

rhornung67 left a comment

Choose a reason for hiding this comment

Uh oh!

rhornung67 commented Dec 19, 2025

Uh oh!

artv3 commented Dec 23, 2025

Uh oh!

MrBurmark commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrBurmark commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rhornung67 commented Jan 5, 2026

Uh oh!

MrBurmark commented Jan 5, 2026

Uh oh!

MrBurmark commented Jan 5, 2026

Uh oh!

MrBurmark commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrBurmark commented Jan 6, 2026

Uh oh!

MrBurmark commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MrBurmark commented Dec 19, 2025 •

edited

Loading

MrBurmark commented Dec 23, 2025 •

edited

Loading

MrBurmark commented Dec 26, 2025 •

edited

Loading

MrBurmark commented Jan 5, 2026 •

edited

Loading