-
Notifications
You must be signed in to change notification settings - Fork 28
CCT Attention Training on Siracusa #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
… have finite lifetime
…y are I/O buffers
7abcc1e to
fe13842
Compare
2223ce7 to
f5f65e3
Compare
f5f65e3 to
a7d6903
Compare
1bbd15b to
df6e698
Compare
📝 WalkthroughSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughThis PR adds support for GELU and LayerNorm gradient operations throughout the Deeploy compiler stack, including parsers, layers, tile constraints, templates, and kernel implementations. It also includes optimizations to GEMM (optional bias handling) and SGD (loop unrolling), updates CI configurations for Siracusa platforms, and refines transpose node naming in topology optimization passes. Changes
Sequence Diagram(s)sequenceDiagram
participant Compiler as Deeploy Compiler
participant Parser as GELUGrad/LayerNormGrad<br/>Parser
participant Layer as Gradient Layer
participant Mapper as Tiling Mapper
participant TileConstraint as Tile Constraint
participant Template as Gradient Template
participant Kernel as Target Kernel
Compiler->>Parser: Parse gradient node (GELUGrad/LayerNormGrad)
Parser->>Parser: Extract inputs (grad_in, data_in, etc.)
Parser->>Layer: Create Layer instance
Layer->>Mapper: Apply mapper with parsed operatorRepresentation
Mapper->>TileConstraint: Initialize tile constraint
TileConstraint->>TileConstraint: Add geometrical/policy constraints
TileConstraint->>Template: Serialize tiling solution
Template->>Template: Generate parallel chunk-based code
Template->>Kernel: Invoke gradient kernel (chunk range)
Kernel->>Kernel: Compute per-element gradient
Kernel->>Compiler: Return grad_out
sequenceDiagram
participant Old as Old SGD/GEMM
participant Codegen as Code Generator
participant Unroll as Unrolled Implementation
participant Kernel as Optimized Kernel
Old->>Codegen: Simple per-element loop
Codegen->>Codegen: Single operation per iteration
Codegen->>Kernel: Generate baseline code
alt New SGD/GEMM Path
Codegen->>Unroll: Extract chunks (6-wide blocks)
Unroll->>Unroll: Batch multiply (6 elements)
Unroll->>Unroll: Batch update weights/accumulate
Unroll->>Kernel: Process tail elements
Kernel->>Codegen: Optimized execution
end
alt GEMM Optional Bias
Codegen->>Unroll: Check has_bias flag
alt Bias present
Unroll->>Kernel: Add per-row bias
else No bias
Unroll->>Kernel: Pass NULL for C
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Tip ✨ Issue Enrichment is now available for GitHub issues!CodeRabbit can now help you manage issues more effectively:
Disable automatic issue enrichmentTo disable automatic issue enrichment, add the following to your issue_enrichment:
auto_enrich:
enabled: falseThanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (8)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)
26-42: Unrolling factor of 6 is functional but unconventional.The loop unrolling works correctly. However, an unroll factor of 4 or 8 typically aligns better with SIMD widths and cache lines on most architectures. If performance is critical, consider benchmarking with power-of-2 unroll factors.
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
355-360: Consider addingstrict=Truetozip()calls for robustness.While the lists are constructed together and should always match in length, adding
strict=True(Python 3.10+) would catch any future bugs where list lengths diverge. This is optional given the current structure guarantees equal lengths.if has_bias: - for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes): + for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict=True): inputLoadSchedule.append({"A": a, "B": b, "C": c}) else: - for a, b in zip(inputACubes, inputBCubes): + for a, b in zip(inputACubes, inputBCubes, strict=True): inputLoadSchedule.append({"A": a, "B": b})Deeploy/Targets/Generic/Layers.py (1)
453-457: Consider addingcomputeOps()method for consistency.The
LayerNormGradLayerclass doesn't implementcomputeOps(), while other gradient layers likeGELUGradLayer(Line 66) andSoftmaxGradLayer(Line 134) do. For consistency and accurate operation counting in analysis workflows, consider adding this method.Would you like me to help estimate the operation count for LayerNorm gradient based on the kernel implementation in
TargetLibraries/Generic/src/Layernorm_fp32.c?Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)
113-121: Rename unused loop variables.The loop variable
dimis not used in the loop body on lines 113-117 and 118-121. Following Python conventions, rename it to_dimor simply remove it if only the index is needed.Apply this diff:
- for idx, dim in enumerate(input_shape): + for idx, _ in enumerate(input_shape): tilerModel.addConstraint( tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) == tilerModel.getTensorDimVar(tensorName = grad_in_buffer_name, dimIdx = idx)) - for idx, dim in enumerate(input_shape): + for idx, _ in enumerate(input_shape): tilerModel.addConstraint( tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) == tilerModel.getTensorDimVar(tensorName = grad_out_buffer_name, dimIdx = idx))TargetLibraries/Generic/src/Layernorm_fp32.c (1)
40-43: Unusedbiasparameter.The
biasparameter is declared but never used in the function body. LayerNorm gradient with respect to input typically doesn't require the bias term, but the signature inconsistency with the forward pass may cause confusion.Consider either:
- Documenting why
biasis intentionally unused (for API consistency with forward pass), or- Removing it if not needed for gradient computation:
-void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in, - float32_t *grad_out, float32_t *scale, - float32_t *bias, float32_t epsilon, int32_t size, - int32_t lastDimLength) { +void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in, + float32_t *grad_out, float32_t *scale, + float32_t epsilon, int32_t size, + int32_t lastDimLength) {Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)
16-17: Return type annotation inconsistency.The method signature declares
Tuple[NetworkContext, Dict, List[str]]but should returnTuple[NetworkContext, OperatorRepresentation, List[str]]to match the base classNodeTemplate.alignToContextsignature shown in the relevant snippets.def alignToContext(self, ctxt: NetworkContext, - operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]: + operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (2)
49-59: Consider extracting axis normalization into a helper method.The negative axis normalization logic is duplicated across
addGeometricalConstraint(lines 54-59 and 80-86),constructSymbolicNodeRep(lines 154-158), andserializeTilingSolution(lines 210-216). Extracting this into a shared static helper would reduce duplication and ensure consistent behavior.@staticmethod def _normalize_axes(axis, input_shape_len: int) -> List[int]: """Normalize axis indices, handling negative values.""" if isinstance(axis, int): axis = [axis] normalized = [] for ax in axis: if ax < 0: ax = input_shape_len + ax normalized.append(ax) return normalizedAlso applies to: 75-86, 146-158, 210-216
237-238: Move import to top of file.The
HyperRectangleimport should be placed at the top of the file with other imports from the same module, unless there's a specific reason to avoid it (e.g., circular dependency).from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme +from Deeploy.TilingExtension.TilingCodegen import HyperRectangleOr combine into a single import:
-from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme +from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, VariableReplacementScheme
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
.github/workflows/ci-platform-siracusa-tiled.yml(3 hunks).github/workflows/ci-platform-siracusa.yml(0 hunks)Deeploy/Targets/Generic/Layers.py(3 hunks)Deeploy/Targets/Generic/Parsers.py(2 hunks)Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py(1 hunks)Deeploy/Targets/PULPOpen/Bindings.py(2 hunks)Deeploy/Targets/PULPOpen/Platform.py(5 hunks)Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py(3 hunks)Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py(5 hunks)Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/Tiler.py(3 hunks)Deeploy/TilingExtension/TilingCodegen.py(1 hunks)TargetLibraries/Generic/inc/kernel/GELU.h(1 hunks)TargetLibraries/Generic/inc/kernel/Layernorm.h(1 hunks)TargetLibraries/Generic/src/GELU_fp32.c(1 hunks)TargetLibraries/Generic/src/Layernorm_fp32.c(1 hunks)TargetLibraries/PULPOpen/src/Gemm.c(1 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/ci-platform-siracusa.yml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
📚 Learning: 2025-09-09T15:43:20.195Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.
Applied to files:
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.pyDeeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
🧬 Code graph analysis (10)
TargetLibraries/Generic/inc/kernel/GELU.h (1)
TargetLibraries/Generic/src/GELU_fp32.c (1)
GELU_fp32_fp32_sigmoid_grad_chunk(34-53)
TargetLibraries/Generic/inc/kernel/Layernorm.h (1)
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
LayernormGrad_fp32_fp32(40-93)
Deeploy/Targets/Generic/Layers.py (1)
Deeploy/DeeployTypes.py (2)
ONNXLayer(1819-2147)computeOps(1835-1840)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(87-229)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (4)
Deeploy/DeeployTypes.py (1)
lookup(720-752)Deeploy/TilingExtension/TilerModel.py (1)
getTensorDimVar(131-135)Deeploy/TilingExtension/TilingCodegen.py (1)
HyperRectangle(24-35)Deeploy/TilingExtension/TileConstraint.py (1)
extractBaseAddr(56-74)
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(87-229)
Deeploy/Targets/PULPOpen/Bindings.py (4)
Deeploy/DeeployTypes.py (1)
NodeBinding(1512-1657)Deeploy/Targets/Generic/TypeCheckers.py (3)
AddChecker(88-102)LayerNormChecker(198-209)GELUChecker(388-402)Deeploy/AbstractDataTypes.py (1)
PointerClass(536-559)Deeploy/CommonExtensions/DataTypes.py (1)
float32_t(74-78)
Deeploy/Targets/PULPOpen/Tiler.py (4)
Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)
GeluGradTileConstraint(8-12)Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (2)
LayernormTileConstraint(19-80)LayernormGradTileConstraint(83-159)Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1)
ReduceSumTileConstraint(18-250)Deeploy/TilingExtension/TilerExtension.py (1)
TilingReadyNodeBindings(1027-1035)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3)
Deeploy/DeeployTypes.py (3)
NetworkContext(508-1020)NodeTemplate(87-229)alignToContext(119-139)Deeploy/Targets/PULPOpen/Templates/ReshapeTemplate.py (1)
alignToContext(37-53)Deeploy/Targets/PULPOpen/Templates/DMASliceTemplate.py (1)
alignToContext(18-138)
Deeploy/Targets/PULPOpen/Platform.py (3)
Deeploy/Targets/Generic/Layers.py (2)
GELUGradLayer(61-70)LayerNormGradLayer(453-456)Deeploy/Targets/Generic/Parsers.py (3)
GELUGradParser(773-797)LayerNormGradParser(1677-1704)LayerNormParser(1647-1674)Deeploy/DeeployTypes.py (1)
NodeMapper(1660-1816)
🪛 Ruff (0.14.5)
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py
113-113: Loop control variable dim not used within loop body
Rename unused dim to _dim
(B007)
118-118: Loop control variable dim not used within loop body
Rename unused dim to _dim
(B007)
128-128: Unused class method argument: ctxt
(ARG003)
Deeploy/Targets/Generic/Parsers.py
786-786: Unused method argument: channels_first
(ARG002)
1691-1691: Unused method argument: channels_first
(ARG002)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
290-290: Local variable varB is assigned to but never used
Remove assignment to unused variable varB
(F841)
356-356: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
359-359: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py
71-71: Avoid specifying long messages outside the exception class
(TRY003)
92-93: Avoid specifying long messages outside the exception class
(TRY003)
127-127: Unused static method argument: parseDict
(ARG004)
127-127: Unused static method argument: ctxt
(ARG004)
133-133: Unused static method argument: tilerModel
(ARG004)
172-172: Unused class method argument: absoluteOutputCubes
(ARG003)
Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py
159-159: Unused class method argument: ctxt
(ARG003)
🔇 Additional comments (39)
.github/workflows/ci-platform-siracusa-tiled.yml (3)
137-138: Verify intentionality of L1 configuration simplification for CCT tests.The change reduces L1 memory configurations from multiple test values to a single value, which limits test coverage to one memory allocation strategy. According to the AI summary, this was
[4000, 64000]and is now[64000]. Confirm this reduction is intentional and doesn't mask hardware-specific issues that only surface with smaller memory allocations.
168-171: Verify consistency of new test addition across all job configurations.A new test entry
testTrainCCT/CCT2_FT2is added to the L3 jobs withL1: [128000]. Per the AI summary, this test should also be added to the singlebuffer-L2 configuration (line 138 area). Confirm whether this is intentional or if the test entry is missing from singlebuffer-L2 and other jobs.
208-211: Verify consistency of L1 simplification across doublebuffer-L3.Changes here mirror those in singlebuffer-L3 (lines 169-171), which is good for consistency. However, apply the same verification from the previous comment to ensure testTrainCCT/CCT2_FT2 is correctly scoped to only L3 jobs and not missing from other configurations.
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1)
679-680: LGTM! Naming collision fix improves correctness.The updated naming scheme correctly prevents collisions when multiple transpose nodes feed into the same (or identically-named) downstream nodes. By prefixing with
t1.name, each split transpose gets a unique identifier derived from both the source transpose and its consumer.Deeploy/TilingExtension/TilingCodegen.py (1)
34-35: LGTM! Defensive tuple coercion improves robustness.The normalization ensures
offsetanddimsare always tuples internally, which is useful when callers pass lists or other sequences. Consider updating the type hints fromTuple[int, ...]toSequence[int]to reflect the accepted input types more accurately.TargetLibraries/Generic/inc/kernel/GELU.h (1)
28-30: LGTM! Function declaration is consistent with implementation.The new gradient chunk function declaration aligns with the implementation in
GELU_fp32.cand follows the existing naming conventions in this header.Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
200-210: LGTM! Optional bias handling is correctly implemented.The conditional bias detection and buffer handling ensures both biased and unbiased GEMM configurations are properly supported. The pattern is consistently applied in both
addGeometricalConstraintandserializeTilingSolution.Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
40-51: LGTM! Kernel call with proper guard condition.The check for
elem_count > 0correctly prevents calling the kernel with zero elements, which could occur for cores that have no work assigned when the sequence count is less than the core count.Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)
8-12: LGTM!The
GeluGradTileConstraintclass correctly extendsBOPTileConstraintwith appropriate gradient-specific tensor names (grad_in, data_in, grad_out), following the established pattern for binary operation tile constraints.TargetLibraries/Generic/inc/kernel/Layernorm.h (1)
28-31: LGTM!The function declaration for
LayernormGrad_fp32_fp32is correctly added, with a signature that mirrors the forward pass and properly includes gradient-specific parameters (grad_in, grad_out). The implementation exists in the corresponding .c file.TargetLibraries/Generic/src/GELU_fp32.c (1)
34-53: LGTM!The GELU gradient computation is mathematically correct. The function properly implements the derivative of the GELU approximation (x * sigmoid(1.702*x)) using the chain rule and applies the upstream gradient correctly.
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)
12-20: LGTM!The gradient template correctly implements parallel chunking across cores, with proper bounds checking and invocation of the GELU gradient kernel. The chunk size calculation and boundary handling prevent out-of-bounds access.
Deeploy/Targets/Generic/Layers.py (2)
61-71: LGTM!The
GELUGradLayeris properly implemented with a reasonable operation count estimate of 9 operations per element for the GELU gradient computation.
482-486: LGTM!The
computeOps()method correctly estimates 2 operations per element (one multiply and one add/subtract) for the SGD weight update step.Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)
83-159: Verify tiling schedule correctness for gradient flow.The
LayernormGradTileConstraintimplementation looks correct overall. However, ensure that the tiling schedule properly handles the gradient data flow:
- Line 151: Both
grad_inanddata_inare loaded with the same cube dimensions- Line 153:
grad_outuses the same cube dimensions as the inputsThis is consistent with the forward pass pattern and the gradient computation requirements.
Consider verifying with a test that exercises tiling for LayerNorm gradient to ensure the tile boundaries and data dependencies are correctly handled.
Deeploy/Targets/Generic/Parsers.py (2)
773-798: LGTM!The
GELUGradParsercorrectly parses GELU gradient nodes with proper input/output mapping. The parser validates 2 inputs (upstream gradient and forward input) and 1 output, which aligns with the gradient computation requirements.
1677-1705: LGTM!The
LayerNormGradParsercorrectly extendsiLayerNormParserand validates 4 inputs (grad_in, data_in, weight, bias) and 1 output for the LayerNorm gradient operation. The size and lastDimLength computations are correct.Deeploy/Targets/PULPOpen/Tiler.py (4)
18-25: LGTM!The import updates correctly add gradient-specific bindings (GELUGrad, LayerNormGrad, SoftmaxGrad, SGD) needed for the new tiling ready bindings defined later in the file.
30-35: LGTM!The tile constraint imports are properly added to support gradient operations (GeluGradTileConstraint, LayernormGradTileConstraint, SoftmaxGradTileConstraint, ReduceSumTileConstraint), enabling gradient-aware tiling throughout the pipeline.
124-125: LGTM!The new gradient tiling ready bindings are correctly configured:
PULPLayernormGradTilingReadyBindingsusesLayernormGradTileConstraintPULPFPGELUGradTilingReadyBindingsusesGeluGradTileConstraintBoth follow the established pattern and properly wire gradient constraints into the tiling framework.
Also applies to: 130-131
143-146: LGTM!The updates to
PULPSoftmaxGradTilingReadyBindingsandPULPReduceSumTilingReadyBindingsreplace implicitUntiledTileConstraintwith explicit gradient-specific constraints (SoftmaxGradTileConstraintandReduceSumTileConstraint), improving tiling optimization for these operations.TargetLibraries/PULPOpen/src/Gemm.c (5)
29-31: LGTM: Bias handling and unroll factor initialization.The
has_biasflag via NULL check and unroll factor computation are correctly implemented. The modulo-based tail calculation ensures correct handling of non-divisible dimensions.
33-83: LGTM: Non-transposed GEMM path with O-dimension unrolling.The index calculations are correct for row-major layout. The 6-wide O-dimension unrolling with proper tail handling is well-implemented.
84-178: LGTM: Transposed-A GEMM path with dual-axis unrolling.The transpose indexing
pSrcA[k*M + i]correctly accesses columniof the original matrix. Both N and O dimension unrolling with tail handling are correctly implemented.
179-264: LGTM: Transposed-B GEMM path.The
b_rowpointer setup for transposed B access is correct. The dual-axis unrolling efficiently processes 6 output columns while iterating over the reduction dimension.
265-353: LGTM: Both-transposed GEMM path.The combination of transpose indexing for both matrices is correctly implemented, with consistent unrolling strategy across all code paths.
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
72-90: Verify LayerNorm gradient formula correctness.The implementation computes
sum_dyas the unscaled sum of gradients (line 73), but the standard LayerNorm backward pass formula typically requiressum(dy * scale)for the mean subtraction term. The current formula:grad_in[j] - (sum_dy / lastDimLength)Should potentially be:
grad_in[j] * scale[j] - (sum_dy_scaled / lastDimLength)where
sum_dy_scaled = sum(dy * scale). The declared but unusedsum_dy_scaledvariable (line 45) suggests this may have been the original intent.Please verify against the PyTorch or reference LayerNorm backward implementation to confirm the gradient computation is correct.
Deeploy/Targets/PULPOpen/Bindings.py (3)
423-429: LGTM: LayerNorm gradient binding.The binding correctly specifies 4 float32 inputs (grad_in, data_in, weight, bias) and 1 output, matching the
LayerNormGradParsersignature. The use ofForkTransformeris consistent with other PULP bindings.
435-437: LGTM: GELU gradient binding.The gradient binding mirrors the forward
PULPFloatGELUBindingstructure with 2 inputs and 1 output, appropriately usingreferenceGradTemplate.
191-196: I'll help you verify the 3-input Add template compatibility concern. Let me start by locating the FloatAddTemplate definition and examining its implementation.
<function_calls>
<invoke_name>shell
#!/bin/bashLocate FloatAddTemplate and its referenceTemplate
rg -n "class FloatAddTemplate|FloatAddTemplate.referenceTemplate" --type py -A 5 -B 2
</invoke_name>
</function_calls>
<invoke_name>shell
#!/bin/bashSearch for FloatAddTemplate definition across the codebase
fd . -type f -name "*.py" | xargs rg -l "FloatAddTemplate" | head -20
</invoke_name>
</function_calls>
<invoke_name>shell
#!/bin/bashFind the file containing FloatAddTemplate definition
rg -n "class FloatAddTemplate" --type py
</invoke_name>
</function_calls>Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (2)
19-25: LGTM: Optional bias handling in alignToContext.The method correctly handles the no-bias case by setting sensible defaults. The
C_typedefault tofloat32_tPtrensures the generated function signature remains valid even when C is NULL.
28-79: LGTM: GEMM template with conditional bias handling.The template correctly handles both biased and bias-free GEMM cases with appropriate conditional code generation. The batch pointer arithmetic is correctly guarded by both
C is not NoneandC_batchedconditions.Deeploy/Targets/PULPOpen/Platform.py (3)
59-59: LGTM: GELUGradMapper definition.The mapper correctly wires
GELUGradParsertoPULPFPGELUGradTilingReadyBindings, following the established pattern of other mappers in this file.
89-89: LGTM: LayerNormGradMapper definition.The mapper correctly connects
LayerNormGradParsertoPULPLayernormGradTilingReadyBindings.
118-120: LGTM: Gradient operation mapping entries.The
GeluGradandLayerNormalizationGradentries are correctly added toPULPMapping, following the same pattern as existing gradient operations likeSoftmaxGrad.Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1)
106-188: LGTM!The
SoftmaxGradTileConstraintimplementation follows the same well-established pattern asiSoftmaxTileConstraint. The three-tensor constraint logic inaddGeometricalConstraintcorrectly ensures dimensional consistency across gradient inputs and outputs, and the serialization properly maps the input/output load schedules.Regarding the static analysis hint about
ctxtbeing unused on line 159: this parameter is part of theTileConstraintinterface signature and must be maintained for API consistency across all constraint implementations.Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (3)
126-130: LGTM!The empty implementation with explanatory comment is acceptable. The parent class interface requires this method signature, and deferring policy decisions to
serializeTilingSolutionis a valid design choice for ReduceSum operations.
132-168: LGTM!The method correctly constructs the symbolic representation with normalized axes and keepdims handling. The unused
tilerModelparameter is an interface requirement from the parentTileConstraintclass.
170-175: I'll help you verify this review comment. Let me start by examining howabsoluteOutputCubesis used across different tile constraint classes and then look at the specific implementation in ReduceSumTileConstraint.
<function_calls>
<invoke_name>shell
<invoke_body>#!/bin/bashSearch for absoluteOutputCubes usage in all TileConstraint files
rg -n "absoluteOutputCubes" Deeploy/Targets/PULPOpen/TileConstraints/ -A 5 -B 2
</invoke_body>
</invoke_function_calls>
</function_calls>
| const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; | ||
| const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; | ||
| float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type inconsistency: hardcoded float* vs templated kernel type.
The pointer declarations use hardcoded const float* and float*, but the kernel call uses ${grad_in_type.referencedType.typeWidth} for type selection. If this template is used with non-fp32 types, there will be a type mismatch.
Consider using the template type for consistency:
-const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
-const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
-float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
+const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
+const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
+float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; | |
| const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; | |
| float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start; | |
| const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; | |
| const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; | |
| float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start; |
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py around lines
36-38, the pointer declarations are hardcoded to float (const float* / float*)
which mismatches the templated kernel types; replace those hardcoded types with
the corresponding template types so pointers use the same referenced type as the
kernel: use ${grad_in_type.referencedType.type} for ${nodeName}_grad_in_ptr,
${data_in_type.referencedType.type} for ${nodeName}_data_in_ptr, and
${grad_out_type.referencedType.type} for ${nodeName}_grad_out_ptr so the pointer
types stay consistent with the kernel's instantiated type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valid comment to address. We should always use the data types with the explicit bitwidth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Xeratec, thank you for confirming! You're absolutely right – using explicit bitwidth types ensures type safety and consistency across the template.
Here's the corrected code for lines 36-38:
const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;This ensures the pointer types match the templated kernel types (e.g., float32_t*, float16_t*) rather than being hardcoded to float*, maintaining consistency throughout the template instantiation.
✏️ Learnings added
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 69
File: Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py:36-38
Timestamp: 2025-12-02T13:54:22.680Z
Learning: In Deeploy templates (Python files in Deeploy/Targets/PULPOpen/Templates/), always use explicit bitwidth types (e.g., `float${...type.referencedType.typeWidth}_t*`) instead of hardcoded types (e.g., `float*`) to ensure type consistency with templated kernel calls.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0); | ||
| int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size}); | ||
| int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential overflow: int16_t may be insufficient for large tensors.
chunk, chunk_start, and chunk_stop are declared as int16_t, limiting ${size} to a maximum of 32,767 elements. For larger tensors, these variables will overflow, causing incorrect chunking and potential out-of-bounds access. Consider using int32_t or uint32_t for these variables.
-int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
-int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
-int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
+int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
+int32_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
+int32_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0); | |
| int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size}); | |
| int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size}); | |
| int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0); | |
| int32_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size}); | |
| int32_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size}); |
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py around lines 12 to 14, the
variables `${nodeName}_chunk`, `${nodeName}_chunk_start` and
`${nodeName}_chunk_stop` are declared as int16_t which can overflow for sizes
>32,767; change their type to a 32-bit integer (e.g., int32_t or uint32_t) and
ensure the arithmetic uses that wider type (cast operands if necessary) so the
right-shift, multiplication and MIN operations are performed in 32-bit space and
prevent out‑of‑bounds chunking for large tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also a valid comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Xeratec, thanks for confirming! You're right—this is an important fix. Using int32_t instead of int16_t for the chunk variables will prevent overflow issues when training with larger tensors (anything over 32K elements), ensuring the parallel work distribution remains correct across all cores.
| varA = operatorRepresentation['A'] | ||
| varB = operatorRepresentation['B'] | ||
|
|
||
| if transA == 0: | ||
| NSize = buffA.shape[-1] | ||
| NSize = ctxt.lookup(varA).shape[-1] | ||
| else: | ||
| NSize = buffA.shape[-2] | ||
| NSize = ctxt.lookup(varA).shape[-2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove unused varB assignment.
varB is assigned on line 290 but never used, as confirmed by static analysis. Remove this dead code.
varA = operatorRepresentation['A']
- varB = operatorRepresentation['B']
if transA == 0:
NSize = ctxt.lookup(varA).shape[-1]
else:
NSize = ctxt.lookup(varA).shape[-2]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| varA = operatorRepresentation['A'] | |
| varB = operatorRepresentation['B'] | |
| if transA == 0: | |
| NSize = buffA.shape[-1] | |
| NSize = ctxt.lookup(varA).shape[-1] | |
| else: | |
| NSize = buffA.shape[-2] | |
| NSize = ctxt.lookup(varA).shape[-2] | |
| varA = operatorRepresentation['A'] | |
| if transA == 0: | |
| NSize = ctxt.lookup(varA).shape[-1] | |
| else: | |
| NSize = ctxt.lookup(varA).shape[-2] |
🧰 Tools
🪛 Ruff (0.14.5)
290-290: Local variable varB is assigned to but never used
Remove assignment to unused variable varB
(F841)
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py around lines
289 to 295, the variable varB is assigned from operatorRepresentation['B'] but
never used; remove the unused varB assignment so only varA is read (keep varA =
operatorRepresentation['A']), and leave the subsequent logic that computes NSize
based on transA and ctxt.lookup(varA).shape intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Xeratec, thank you for confirming! Feel free to apply the suggestion when you're ready to remove the unused varB assignment.
🧠 Learnings used
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.
Xeratec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
Looks already very good. I still have some small comments, but nothing major.
| def computeOps(self): | ||
|
|
||
| size = self.mapper.parser.operatorRepresentation['size'] | ||
| return size * 2 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not consider this as actual operations. I would argue that to compute the Op/s, we should only consider arithmetic operations and transpositions, thus have zero.
| ] + [ | ||
| NodeBinding( | ||
| AddChecker([PointerClass(float32_t), PointerClass(float32_t), | ||
| PointerClass(float32_t)], [PointerClass(float32_t)]), FloatAddTemplate.referenceTemplate, | ||
| ForkTransformer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 3-input adder tested and actually supported by the FloatAddTemplate?
| const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; | ||
| const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; | ||
| float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valid comment to address. We should always use the data types with the explicit bitwidth.
| int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0); | ||
| int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size}); | ||
| int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also a valid comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diaconuccalin Would you mind reviewing this file, as you are now very familiar with the GEMMTileConstraint.py and were the last person to change it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@runwangdl I do not see a reason to remove the comments added by Calin. I believe it makes the code easier to read.
| varA = operatorRepresentation['A'] | ||
| varB = operatorRepresentation['B'] | ||
|
|
||
| if transA == 0: | ||
| NSize = buffA.shape[-1] | ||
| NSize = ctxt.lookup(varA).shape[-1] | ||
| else: | ||
| NSize = buffA.shape[-2] | ||
| NSize = ctxt.lookup(varA).shape[-2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid comment.
This PR introduces gradient operator support, improved GEMM performance, and updates to the CCT training workflow. It also includes fixes to tile constraints and naming consistency in transpose pass.
Added
Changed
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.