-
Notifications
You must be signed in to change notification settings - Fork 91
duc - Pipeline PhasedAccu for n>8 and allow non-log(2) values of n when n>8 #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Looks good to me, nice to see tests extended as well. |
@@ -205,11 +205,13 @@ def __init__(self, width, constants): | |||
|
|||
### | |||
|
|||
self.latency = 2 if n > 8 else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latency should be 0 for comb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what happened here is that I wrote the latency of the PhasedAccu in here rather than just the MCM. I think that results in the following:
PhasedAccu latency 0 for n<=8, 2 for n>8
MCM latency 0 for n<=8, 1 for n>8
I think it makes sense to give both classes a latency attribute so this is clear to users even if there's currently no code checking for latency attributes in these classes.
@@ -223,11 +225,25 @@ def __init__(self, width, constants): | |||
if n > 5: | |||
ctx += o[5].eq(i + (i << 2)) | |||
if n > 6: | |||
ctx += o[6].eq(o[3] << 1) | |||
ctx += o[6].eq((i << 2) + (i << 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually make a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if the context is sync
.
if n > 14: | ||
ctx += o[14].eq((i << 3) + (i << 2) + (i << 1)) | ||
if n > 15: | ||
ctx += o[15].eq(i + (i << 3) + (i << 2) + (i << 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the entire list be generated programmatically?
Also I'd have thought Vivado would do these optimizations already if you simply write *3, *4, *5, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed through docs and haven't found a definite answer. Even if so, I wouldn't want to risk Vivado messing it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to check, and issues would fail timing closure, and we rely on vivado for more complicated things than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; replaced it with a simple loop with *x
syntax, TNS fails severely:
------------------------------------------------------------------------------------------------
| Design Timing Summary
| ---------------------
------------------------------------------------------------------------------------------------
WNS(ns) TNS(ns) TNS Failing Endpoints TNS Total Endpoints WHS(ns) THS(ns) THS Failing Endpoints THS Total Endpoints WPWS(ns) TPWS(ns) TPWS Failing Endpoints TPWS Total Endpoints
------- ------- --------------------- ------------------- ------- ------- --------------------- ------------------- -------- -------- ---------------------- --------------------
-0.651 -325.990 1640 72371 0.042 0.000 0 72364 -0.259 -4.802 34 33075
Timing constraints are not met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What design is that? Take a look at the netlist as well to confirm what Vivado did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this could use a code comment explaining the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that vivado isn't smart enough to use shift and add operations when optimizing multiplication. Instead, it uses a full multiplier and fails timing as a consequence.
One option I considered was to add a "use_dsp" synthesis attribute to use DSP slices which can meet timing even for larger n, especially on newer FPGAs. At the end, though, I went back to shift and add to stick with the principle that is already in use.
A way to generate the list programmatically is to use Canonical Signed Digit (CSD) representations to find the optimal decomposition. If this is of interest I'll be happy to create a small n multiplier based on CSD for review. Here's a link with some more info:
https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the netlist and it does use a DSP48E1 slice if you just use "*". It does meet timing, with a caveat.
Regardless of method used for MCM, EFC for LTC synthesis fails to meet timing at few points with 125MHz clock (OK on 100MHz). I was looking at the 125MHz output with the initial report - in this case, using the DSP made the timing issues worse. Regardless, this should be fine and the timing failure is in another spot in ARTIQ side that I will fix next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds wasteful to allocate DSP slices for those multiplications anyway.
Are these modifications compatible with https://github.com/quartiq/phaser ? |
Phaser gateware is fixed on early 2021 misoc/migen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as on #172
This patch was added with the Artiq EFC LTC2K variant in order to get timing closure for n>8. We also needed non-log(2) phases so that requirement is also lifted with the patch.