Skip to content

duc - Pipeline PhasedAccu for n>8 and allow non-log(2) values of n when n>8 #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 61 additions & 18 deletions misoc/cores/duc.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,11 +205,13 @@ def __init__(self, width, constants):

###

self.latency = 2 if n > 8 else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency should be 0 for comb.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what happened here is that I wrote the latency of the PhasedAccu in here rather than just the MCM. I think that results in the following:

PhasedAccu latency 0 for n<=8, 2 for n>8
MCM latency 0 for n<=8, 1 for n>8

I think it makes sense to give both classes a latency attribute so this is clear to users even if there's currently no code checking for latency attributes in these classes.


# TODO: improve MCM
assert n <= 9
assert n <= 16
assert range(n) == constants

ctx = self.comb
ctx = self.sync if n > 8 else self.comb
if n > 0:
ctx += o[0].eq(0)
if n > 1:
Expand All @@ -223,11 +225,25 @@ def __init__(self, width, constants):
if n > 5:
ctx += o[5].eq(i + (i << 2))
if n > 6:
ctx += o[6].eq(o[3] << 1)
ctx += o[6].eq((i << 2) + (i << 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually make a difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if the context is sync.

if n > 7:
ctx += o[7].eq((i << 3) - i)
if n > 8:
ctx += o[8].eq(i << 3)
if n > 9:
ctx += o[9].eq(i + (i << 3))
if n > 10:
ctx += o[10].eq((i << 3) + (i << 1))
if n > 11:
ctx += o[11].eq(i + (i << 3) + (i << 1))
if n > 12:
ctx += o[12].eq((i << 3) + (i << 2))
if n > 13:
ctx += o[13].eq(i + (i << 3) + (i << 2))
if n > 14:
ctx += o[14].eq((i << 3) + (i << 2) + (i << 1))
if n > 15:
ctx += o[15].eq(i + (i << 3) + (i << 2) + (i << 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the entire list be generated programmatically?

Also I'd have thought Vivado would do these optimizations already if you simply write *3, *4, *5, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed through docs and haven't found a definite answer. Even if so, I wouldn't want to risk Vivado messing it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to check, and issues would fail timing closure, and we rely on vivado for more complicated things than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; replaced it with a simple loop with *x syntax, TNS fails severely:

------------------------------------------------------------------------------------------------
| Design Timing Summary
| ---------------------
------------------------------------------------------------------------------------------------

    WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints     WPWS(ns)     TPWS(ns)  TPWS Failing Endpoints  TPWS Total Endpoints  
    -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------     --------     --------  ----------------------  --------------------  
     -0.651     -325.990                   1640                72371        0.042        0.000                      0                72364       -0.259       -4.802                      34                 33075  


Timing constraints are not met.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What design is that? Take a look at the netlist as well to confirm what Vivado did.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this could use a code comment explaining the situation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that vivado isn't smart enough to use shift and add operations when optimizing multiplication. Instead, it uses a full multiplier and fails timing as a consequence.

One option I considered was to add a "use_dsp" synthesis attribute to use DSP slices which can meet timing even for larger n, especially on newer FPGAs. At the end, though, I went back to shift and add to stick with the principle that is already in use.

A way to generate the list programmatically is to use Canonical Signed Digit (CSD) representations to find the optimal decomposition. If this is of interest I'll be happy to create a small n multiplier based on CSD for review. Here's a link with some more info:
https://www.allaboutcircuits.com/technical-articles/an-introduction-to-canonical-signed-digit-representation/

Copy link
Contributor

@Spaqin Spaqin Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the netlist and it does use a DSP48E1 slice if you just use "*". It does meet timing, with a caveat.

Regardless of method used for MCM, EFC for LTC synthesis fails to meet timing at few points with 125MHz clock (OK on 100MHz). I was looking at the 125MHz output with the initial report - in this case, using the DSP made the timing issues worse. Regardless, this should be fine and the timing failure is in another spot in ARTIQ side that I will fix next.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds wasteful to allocate DSP slices for those multiplications anyway.



class PhasedAccu(Module):
Expand All @@ -247,26 +263,53 @@ def __init__(self, n, fwidth, pwidth):
self.z = [Signal(pwidth, reset_less=True)
for _ in range(n)]

self.submodules.mcm = MCM(fwidth, range(n))
self.submodules.mcm = MCM(fwidth, range(n+1))
# reset by clr
qa = Signal(fwidth, reset_less=True)
qb = Signal(fwidth, reset_less=True)
clr_d = Signal(reset_less=True)
self.sync += [
clr_d.eq(self.clr),
qa.eq(qa + (self.f << log2_int(n))),
self.mcm.i.eq(self.f),
If(self.clr | clr_d,
qa.eq(0),
),
If(clr_d,
self.mcm.i.eq(0),
),
qb.eq(qa + (self.p << fwidth - pwidth)),
[z.eq((qb + oi)[fwidth - pwidth:])
for oi, z in zip(self.mcm.o, self.z)]
]

if n > 8:
# additional pipelining for n > 8
clr_d2 = Signal(reset_less=True)
mcm_o_d = [Signal(fwidth, reset_less=True) for _ in range(n)]
self.sync += [
# Delay signals to match now increased mcm latency
clr_d.eq(self.clr),
clr_d2.eq(clr_d),
[mcm_o_d[i].eq(self.mcm.o[i]) for i in range(n)],

qa.eq(qa + self.mcm.o[n]),
self.mcm.i.eq(self.f),
If(clr_d | clr_d2,
qa.eq(0),
),
If(clr_d2,
self.mcm.i.eq(0),
),
qb.eq(qa + (self.p << (fwidth - pwidth))),

# Use delayed signals in the final phase calculation
[z.eq((qb + mcm_o_d[i])[fwidth - pwidth:])
for i, z in enumerate(self.z)]
]
else:
self.sync += [
clr_d.eq(self.clr),
qa.eq(qa + (self.f << log2_int(n))),
self.mcm.i.eq(self.f),
If(self.clr | clr_d,
qa.eq(0),
),
If(clr_d,
self.mcm.i.eq(0),
),
qb.eq(qa + (self.p << (fwidth - pwidth))),

# Use non-delayed signals in the final phase calculation
[z.eq((qb + oi)[fwidth - pwidth:])
for oi, z in zip(self.mcm.o, self.z)]
]

class PhaseModulator(Module):
"""Complex phase modulator/shifter.
Expand Down
92 changes: 92 additions & 0 deletions misoc/test/test_duc.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ def test_init(self):

def test_seq(self):
def gen():
# latency = 2
yield self.dut.clr.eq(0)
yield self.dut.p.eq(0x01)
yield
Expand Down Expand Up @@ -104,6 +105,97 @@ def gen():
self.assertEqual((yield self.dut.z[1]), 0xa1)
run_simulation(self.dut, gen())

class TestPhasedAccuNonLog(unittest.TestCase):
def setUp(self):
self.dut = duc.PhasedAccu(n=12, fwidth=32, pwidth=16)

def test_init(self):
self.assertEqual(len(self.dut.f), 32)
self.assertEqual(len(self.dut.p), 16)
self.assertEqual(len(self.dut.z), 12)
self.assertEqual(len(self.dut.z[0]), 16)

def test_seq(self):
def gen():
# latency = 4
n=12
yield self.dut.clr.eq(0)
yield self.dut.p.eq(0x01)
yield
yield
yield
yield
# check phase offset with f=0
self.assertEqual((yield self.dut.z[0]), 0x01)
self.assertEqual((yield self.dut.z[1]), 0x01)
yield self.dut.f.eq(0x10 << 16)
yield
yield
yield
yield
yield
# check first cycle f increments
for i in range(n):
expected_value = (i << 4) | 0x01
self.assertEqual((yield self.dut.z[i]), expected_value)
yield
# second cycle f increments
for i in range(n):
expected_value += 0x10
self.assertEqual((yield self.dut.z[i]), expected_value)
yield self.dut.clr.eq(1)
yield
for i in range(n):
expected_value += 0x10
self.assertEqual((yield self.dut.z[i]), expected_value)
yield
for i in range(n):
expected_value += 0x10
self.assertEqual((yield self.dut.z[i]), expected_value)
yield self.dut.clr.eq(0)
yield
for i in range(n):
expected_value += 0x10
yield
for i in range(n):
expected_value += 0x10
self.assertEqual((yield self.dut.z[i]), expected_value)
yield
# first clr cycle
for i in range(n):
expected_value = (i << 4) | 0x01
self.assertEqual((yield self.dut.z[i]), expected_value)
yield
# second clr cycle
for i in range(n):
expected_value = (i << 4) | 0x01
self.assertEqual((yield self.dut.z[i]), expected_value)
yield self.dut.f.eq(0x20 << 16)
yield
yield
yield
# first cycle after clr with old f
for i in range(n):
expected_value = (i << 4) | 0x01
self.assertEqual((yield self.dut.z[i]), expected_value)
yield
# second cycle with old f
for i in range(n):
expected_value += 0x10
self.assertEqual((yield self.dut.z[i]), expected_value)
yield
# cycle with one old and one new
expected_value += 0x10
for i in range(n):
self.assertEqual((yield self.dut.z[i]), expected_value)
expected_value += 0x20
yield
# cycle with only new increments
expected_value -= 0x20
for i in range(n):
expected_value += 0x20
self.assertEqual((yield self.dut.z[i]), expected_value)
run_simulation(self.dut, gen())

class TestMul(unittest.TestCase):
def setUp(self):
Expand Down