-
Notifications
You must be signed in to change notification settings - Fork 51
Correctness Checking #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctness Checking #608
Conversation
rhornung67
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either here or in another PR, it may be a good idea to note in the Dev Guide some things to think about when setting the tolerance for a kernel in the code. Maybe a brief explanation of how you determined the tolerance to set for a couple of representative kernels with different tolerances.
|
We've tried to deal with growing checksums by adding a multiplier to keep their magnitude reasonable. We also a Kahan sum approach for summing checksum values. We haven't looked at this in a while, at least I haven't. Maybe re-investigate since problem sizes are getting larger. |
|
Do you mean multiple passes over the suite? or within the number of times the kernel is invoked? |
|
The checksums are added up over passes as well and the checksum scaling factor does not account for that effect. |
SetChecksumTolerance in all kernels for consistency.
|
I updated the documentation for checksum consistency and added some documentation for the checksum tolerance and scaling factor. @rhornung67 |
Do not print the whole checksum.
|
@MrBurmark what do you think about dividing the checksum increment by the # passes (or multiplying by floating pt value 1.0/npasses) each time the checksum is updated for each kernel? This could be hidden in the KernelBase class method, doesn't add much additional arithmetic, and would guarantee that the checksums for a run would always be the same magnitude regardless of # passes. |
|
@rhornung67 Ya we could do something like that. How about dividing by the numbers of times the kernel was run in the getChecksum method? |
|
@rhornung67 I added the division. |
|
Thinking about the checksum a bit more and how to make it independent of the number of passes. How about we keep a sum of all the checksums, the min, and max? Then we could get the largest difference from the mean reference checksum at the end. That would avoid downplaying errors that only occurred in one pass. |
|
Regarding checksums and npasses when deciding on a reference checksum I chose to use the min checksum of the reference variant tuning for consistent kernels as the checksums should all be the same anyway and I chose to use the average checksum of the reference variant tuning for other kernels. Would it be simpler if I just used the first checksum of the reference variant tuning? |
|
I went ahead and switched to use the first pass of the kernel as the reference checksum. |
Summary
Add more support for correctness checking.
Each kernel now can set its own tolerance and we print if the checksum met that tolerance in the show-progress screen output and the checksum output file.
This mechanism is now also used by the test executable.
How concerned should we be that checksums grow over multiple passes?
Example Output
Here is an example of -sp screen output.
Here is an example of checksum.txt file output.