-
Notifications
You must be signed in to change notification settings - Fork 91
duc - Pipeline PhasedAccu for n>8 and allow non-log(2) values of n when n>8 #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -205,11 +205,13 @@ def __init__(self, width, constants): | |
|
||
### | ||
|
||
self.latency = 2 if n > 8 else 1 | ||
|
||
# TODO: improve MCM | ||
assert n <= 9 | ||
assert n <= 16 | ||
assert range(n) == constants | ||
|
||
ctx = self.comb | ||
ctx = self.sync if n > 8 else self.comb | ||
if n > 0: | ||
ctx += o[0].eq(0) | ||
if n > 1: | ||
|
@@ -223,11 +225,25 @@ def __init__(self, width, constants): | |
if n > 5: | ||
ctx += o[5].eq(i + (i << 2)) | ||
if n > 6: | ||
ctx += o[6].eq(o[3] << 1) | ||
ctx += o[6].eq((i << 2) + (i << 1)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this actually make a difference? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes if the context is |
||
if n > 7: | ||
ctx += o[7].eq((i << 3) - i) | ||
if n > 8: | ||
ctx += o[8].eq(i << 3) | ||
if n > 9: | ||
ctx += o[9].eq(i + (i << 3)) | ||
if n > 10: | ||
ctx += o[10].eq((i << 3) + (i << 1)) | ||
if n > 11: | ||
ctx += o[11].eq(i + (i << 3) + (i << 1)) | ||
if n > 12: | ||
ctx += o[12].eq((i << 3) + (i << 2)) | ||
if n > 13: | ||
ctx += o[13].eq(i + (i << 3) + (i << 2)) | ||
if n > 14: | ||
ctx += o[14].eq((i << 3) + (i << 2) + (i << 1)) | ||
if n > 15: | ||
ctx += o[15].eq(i + (i << 3) + (i << 2) + (i << 1)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can the entire list be generated programmatically? Also I'd have thought Vivado would do these optimizations already if you simply write *3, *4, *5, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Skimmed through docs and haven't found a definite answer. Even if so, I wouldn't want to risk Vivado messing it up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's easy to check, and issues would fail timing closure, and we rely on vivado for more complicated things than that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point; replaced it with a simple loop with
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What design is that? Take a look at the netlist as well to confirm what Vivado did. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And this could use a code comment explaining the situation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that vivado isn't smart enough to use shift and add operations when optimizing multiplication. Instead, it uses a full multiplier and fails timing as a consequence. One option I considered was to add a "use_dsp" synthesis attribute to use DSP slices which can meet timing even for larger n, especially on newer FPGAs. At the end, though, I went back to shift and add to stick with the principle that is already in use. A way to generate the list programmatically is to use Canonical Signed Digit (CSD) representations to find the optimal decomposition. If this is of interest I'll be happy to create a small n multiplier based on CSD for review. Here's a link with some more info: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I checked the netlist and it does use a DSP48E1 slice if you just use "*". It does meet timing, with a caveat. Regardless of method used for MCM, EFC for LTC synthesis fails to meet timing at few points with 125MHz clock (OK on 100MHz). I was looking at the 125MHz output with the initial report - in this case, using the DSP made the timing issues worse. Regardless, this should be fine and the timing failure is in another spot in ARTIQ side that I will fix next. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds wasteful to allocate DSP slices for those multiplications anyway. |
||
|
||
|
||
class PhasedAccu(Module): | ||
|
@@ -247,26 +263,53 @@ def __init__(self, n, fwidth, pwidth): | |
self.z = [Signal(pwidth, reset_less=True) | ||
for _ in range(n)] | ||
|
||
self.submodules.mcm = MCM(fwidth, range(n)) | ||
self.submodules.mcm = MCM(fwidth, range(n+1)) | ||
# reset by clr | ||
qa = Signal(fwidth, reset_less=True) | ||
qb = Signal(fwidth, reset_less=True) | ||
clr_d = Signal(reset_less=True) | ||
self.sync += [ | ||
clr_d.eq(self.clr), | ||
qa.eq(qa + (self.f << log2_int(n))), | ||
self.mcm.i.eq(self.f), | ||
If(self.clr | clr_d, | ||
qa.eq(0), | ||
), | ||
If(clr_d, | ||
self.mcm.i.eq(0), | ||
), | ||
qb.eq(qa + (self.p << fwidth - pwidth)), | ||
[z.eq((qb + oi)[fwidth - pwidth:]) | ||
for oi, z in zip(self.mcm.o, self.z)] | ||
] | ||
|
||
if n > 8: | ||
# additional pipelining for n > 8 | ||
clr_d2 = Signal(reset_less=True) | ||
mcm_o_d = [Signal(fwidth, reset_less=True) for _ in range(n)] | ||
self.sync += [ | ||
# Delay signals to match now increased mcm latency | ||
clr_d.eq(self.clr), | ||
clr_d2.eq(clr_d), | ||
[mcm_o_d[i].eq(self.mcm.o[i]) for i in range(n)], | ||
|
||
qa.eq(qa + self.mcm.o[n]), | ||
self.mcm.i.eq(self.f), | ||
If(clr_d | clr_d2, | ||
qa.eq(0), | ||
), | ||
If(clr_d2, | ||
self.mcm.i.eq(0), | ||
), | ||
qb.eq(qa + (self.p << (fwidth - pwidth))), | ||
|
||
# Use delayed signals in the final phase calculation | ||
[z.eq((qb + mcm_o_d[i])[fwidth - pwidth:]) | ||
for i, z in enumerate(self.z)] | ||
] | ||
else: | ||
self.sync += [ | ||
clr_d.eq(self.clr), | ||
qa.eq(qa + (self.f << log2_int(n))), | ||
self.mcm.i.eq(self.f), | ||
If(self.clr | clr_d, | ||
qa.eq(0), | ||
), | ||
If(clr_d, | ||
self.mcm.i.eq(0), | ||
), | ||
qb.eq(qa + (self.p << (fwidth - pwidth))), | ||
|
||
# Use non-delayed signals in the final phase calculation | ||
[z.eq((qb + oi)[fwidth - pwidth:]) | ||
for oi, z in zip(self.mcm.o, self.z)] | ||
] | ||
|
||
class PhaseModulator(Module): | ||
"""Complex phase modulator/shifter. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latency should be 0 for comb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what happened here is that I wrote the latency of the PhasedAccu in here rather than just the MCM. I think that results in the following:
PhasedAccu latency 0 for n<=8, 2 for n>8
MCM latency 0 for n<=8, 1 for n>8
I think it makes sense to give both classes a latency attribute so this is clear to users even if there's currently no code checking for latency attributes in these classes.