CCT Attention Training on Siracusa #69

runwangdl · 2025-05-11T22:16:18Z

This PR introduces gradient operator support, improved GEMM performance, and updates to the CCT training workflow. It also includes fixes to tile constraints and naming consistency in transpose pass.

Added

Implemented LayerNormGrad and GeluGrad operator parser, binding, template, and tile constraints.
Added CCT linear probing, LoRA, and full backpropagation training graph.

Changed

Optimized float GEMM kernel with loop unrolling and improved transpose handling.

Fixed

Corrected float GEMM tile constraints and templates for no bias.
Fixed transpose splitting pass logic: updated repeated naming by appending new identifiers derived from source and destination nodes.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

… have finite lifetime

…y are I/O buffers

coderabbitai · 2025-11-26T21:09:59Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added support for gradient operations including GELU, LayerNorm, ReduceSum, and Softmax gradients for enhanced backpropagation capabilities.
Performance Improvements
- Optimized GEMM computation with optional bias handling and unrolled loops for better throughput.
- Enhanced SGD optimizer with loop unrolling and separated multiplication operations.
Bug Fixes
- Improved handling of optional tensors in tiling constraints and memory management.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

This PR adds support for GELU and LayerNorm gradient operations throughout the Deeploy compiler stack, including parsers, layers, tile constraints, templates, and kernel implementations. It also includes optimizations to GEMM (optional bias handling) and SGD (loop unrolling), updates CI configurations for Siracusa platforms, and refines transpose node naming in topology optimization passes.

Changes

Cohort / File(s)	Summary
CI Configuration Updates `.github/workflows/ci-platform-siracusa-tiled.yml`, `.github/workflows/ci-platform-siracusa.yml`	Modified L1 cache configurations for CCT tests; replaced multi-element L1 arrays with single values; removed testTrainCCT/CCT1_Classifier_Training/CCT_1_16_16_8; added testTrainCCT/CCT2_FT2 entries
Generic Gradient Layer & Parser Infrastructure `Deeploy/Targets/Generic/Layers.py`, `Deeploy/Targets/Generic/Parsers.py`	Added GELUGradLayer and LayerNormGradLayer classes; added computeOps method to SGDLayer; implemented GELUGradParser and LayerNormGradParser with parseNode/parseNodeCtxt methods
Topology Optimization `Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`	Modified transpose node naming in `_split_transposes_fun` to prefix generated node names with first transpose node name
PULPOpen Bindings & Platform Integration `Deeploy/Targets/PULPOpen/Bindings.py`, `Deeploy/Targets/PULPOpen/Platform.py`	Added PULPLayernormGradBinding, PULPFloatGELUGradBinding, and 3-input float addition pattern; wired GELUGrad/LayerNormGrad mappers and layers into platform mapping
Gradient Computation Templates `Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py`, `Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py`	Added referenceGradTemplate for parallel GELU gradient computation with chunk-based processing; added referenceGradTemplate for LayerNorm gradient with thread routing and conditional elem_count evaluation
Optimized Arithmetic Templates `Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py`	Refactored SGD update with loop unrolling (6-wide blocks), chunk-based processing, temporary multiplication buffer, and post-unroll tail handling
GEMM Template & Tiling `Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`, `Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py`	Created PULPFloatGEMMTemplate class with alignToContext method; made bias tensor (C) optional with conditional code paths and tiling serialization logic
Gradient Tile Constraints `Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py`, `Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py`, `Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py`	Added GeluGradTileConstraint, LayernormGradTileConstraint with geometry/policy/serialization methods; added SoftmaxGradTileConstraint with upstream_grad/softmax_output input handling
ReduceSum Tile Constraint `Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py`	Introduced comprehensive ReduceSumTileConstraint with axis/keepdims handling, symbolic representation, and tiling serialization for reduction operations
Tiling Infrastructure `Deeploy/Targets/PULPOpen/Tiler.py`, `Deeploy/TilingExtension/TilingCodegen.py`	Added PULPLayernormGradTilingReadyBindings and PULPFPGELUGradTilingReadyBindings; replaced UntiledTileConstraint with specific constraints for SoftmaxGrad/ReduceSum; normalized HyperRectangle offset/dims to tuples
Gradient Kernel Headers & Implementations `TargetLibraries/Generic/inc/kernel/GELU.h`, `TargetLibraries/Generic/src/GELU_fp32.c`, `TargetLibraries/Generic/inc/kernel/Layernorm.h`, `TargetLibraries/Generic/src/Layernorm_fp32.c`	Added GELU_fp32_fp32_sigmoid_grad_chunk and LayernormGrad_fp32_fp32 function declarations and implementations with gradient computation logic
GEMM Kernel Optimization `TargetLibraries/PULPOpen/src/Gemm.c`	Refactored GEMM implementation with bias flag handling, 6-wide loop unrolling, separate code paths for transpose combinations, and blocked accumulations

Sequence Diagram(s)

sequenceDiagram
    participant Compiler as Deeploy Compiler
    participant Parser as GELUGrad/LayerNormGrad<br/>Parser
    participant Layer as Gradient Layer
    participant Mapper as Tiling Mapper
    participant TileConstraint as Tile Constraint
    participant Template as Gradient Template
    participant Kernel as Target Kernel

    Compiler->>Parser: Parse gradient node (GELUGrad/LayerNormGrad)
    Parser->>Parser: Extract inputs (grad_in, data_in, etc.)
    Parser->>Layer: Create Layer instance
    Layer->>Mapper: Apply mapper with parsed operatorRepresentation
    Mapper->>TileConstraint: Initialize tile constraint
    TileConstraint->>TileConstraint: Add geometrical/policy constraints
    TileConstraint->>Template: Serialize tiling solution
    Template->>Template: Generate parallel chunk-based code
    Template->>Kernel: Invoke gradient kernel (chunk range)
    Kernel->>Kernel: Compute per-element gradient
    Kernel->>Compiler: Return grad_out

sequenceDiagram
    participant Old as Old SGD/GEMM
    participant Codegen as Code Generator
    participant Unroll as Unrolled Implementation
    participant Kernel as Optimized Kernel

    Old->>Codegen: Simple per-element loop
    Codegen->>Codegen: Single operation per iteration
    Codegen->>Kernel: Generate baseline code

    alt New SGD/GEMM Path
        Codegen->>Unroll: Extract chunks (6-wide blocks)
        Unroll->>Unroll: Batch multiply (6 elements)
        Unroll->>Unroll: Batch update weights/accumulate
        Unroll->>Kernel: Process tail elements
        Kernel->>Codegen: Optimized execution
    end

    alt GEMM Optional Bias
        Codegen->>Unroll: Check has_bias flag
        alt Bias present
            Unroll->>Kernel: Add per-row bias
        else No bias
            Unroll->>Kernel: Pass NULL for C
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Dense kernel implementations: GELU_fp32_fp32_sigmoid_grad_chunk and LayernormGrad_fp32_fp32 require verification of gradient math formulas and numerical correctness
Multiple tiling constraints: GeluGradTileConstraint, LayernormGradTileConstraint, ReduceSumTileConstraint, and SoftmaxGradTileConstraint introduce new constraint logic requiring careful validation of dimension propagation and scheduling
Significant template refactoring: SGD and GEMM templates contain substantial loop unrolling and multi-path branching logic that needs verification for correctness and performance
End-to-end integration: Gradient support is wired across parsers → layers → mappers → tile constraints → templates → kernels, requiring tracing of data flow through multiple abstraction layers
Optional bias handling in GEMM: Conditional code paths and serialization logic in GEMMTileConstraint and FloatGemmTemplate require careful inspection for edge cases

Possibly related PRs

Fix PULP GEMM batch serialization #109: Directly modifies GEMM tiling serialization and batch dimension handling in GEMMTileConstraint, overlapping with the GEMM refactoring in this PR
Demo TinyViT compatibility with tiled Siracusa #124: Adds related PULPOpen tiling/platform surface changes, gradient bindings registration, and kernel/template modifications to similar infrastructure
Refactor tiling code generation #105: Introduces foundational tiling/DMA infrastructure (TilingVariableReplacementUpdate, AsyncDma/Mchan/L3 DMA, tiling codegen passes) that this PR builds upon for gradient operator support

Suggested reviewers

Victor-Jung
Xeratec
lukamac

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'CCT Attention Training on Siracusa' is specific and directly reflects the main change: updates to CCT training workflow on the Siracusa platform, evident from CI workflow modifications and added training graph support.
Description check	✅ Passed	The description comprehensively covers all major changes including gradient operators, GEMM optimization, CCT training updates, and bug fixes, directly relating to the changeset across multiple files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Tip

✨ Issue Enrichment is now available for GitHub issues!

CodeRabbit can now help you manage issues more effectively:

Duplicate Detection — Identify similar or duplicate issues
Related Issues & PRs — Find relevant issues and PR's from your repository
Suggested Assignees — Find the best person to work on the issue
Implementation Planning — Generate detailed coding plans for engineers and agents

Disable automatic issue enrichment

To disable automatic issue enrichment, add the following to your .coderabbit.yaml:

issue_enrichment:
  auto_enrich:
    enabled: false

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (8)

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)

26-42: Unrolling factor of 6 is functional but unconventional.

The loop unrolling works correctly. However, an unroll factor of 4 or 8 typically aligns better with SIMD widths and cache lines on most architectures. If performance is critical, consider benchmarking with power-of-2 unroll factors.
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
355-360: Consider adding strict=True to zip() calls for robustness.

While the lists are constructed together and should always match in length, adding strict=True (Python 3.10+) would catch any future bugs where list lengths diverge. This is optional given the current structure guarantees equal lengths.
         if has_bias:
-            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes):
+            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict=True):
                 inputLoadSchedule.append({"A": a, "B": b, "C": c})
         else:
-            for a, b in zip(inputACubes, inputBCubes):
+            for a, b in zip(inputACubes, inputBCubes, strict=True):
                 inputLoadSchedule.append({"A": a, "B": b})
Deeploy/Targets/Generic/Layers.py (1)

453-457: Consider adding computeOps() method for consistency.

The LayerNormGradLayer class doesn't implement computeOps(), while other gradient layers like GELUGradLayer (Line 66) and SoftmaxGradLayer (Line 134) do. For consistency and accurate operation counting in analysis workflows, consider adding this method.

Would you like me to help estimate the operation count for LayerNorm gradient based on the kernel implementation in TargetLibraries/Generic/src/Layernorm_fp32.c?
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)
113-121: Rename unused loop variables.

The loop variable dim is not used in the loop body on lines 113-117 and 118-121. Following Python conventions, rename it to _dim or simply remove it if only the index is needed.

Apply this diff:
-        for idx, dim in enumerate(input_shape):
+        for idx, _ in enumerate(input_shape):
             tilerModel.addConstraint(
                 tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) ==
                 tilerModel.getTensorDimVar(tensorName = grad_in_buffer_name, dimIdx = idx))
 
-        for idx, dim in enumerate(input_shape):
+        for idx, _ in enumerate(input_shape):
             tilerModel.addConstraint(
                 tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) ==
                 tilerModel.getTensorDimVar(tensorName = grad_out_buffer_name, dimIdx = idx))
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
40-43: Unused bias parameter.

The bias parameter is declared but never used in the function body. LayerNorm gradient with respect to input typically doesn't require the bias term, but the signature inconsistency with the forward pass may cause confusion.

Consider either:

Documenting why bias is intentionally unused (for API consistency with forward pass), or

Removing it if not needed for gradient computation:
-void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in,
-                             float32_t *grad_out, float32_t *scale,
-                             float32_t *bias, float32_t epsilon, int32_t size,
-                             int32_t lastDimLength) {
+void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in,
+                             float32_t *grad_out, float32_t *scale,
+                             float32_t epsilon, int32_t size,
+                             int32_t lastDimLength) {
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)
16-17: Return type annotation inconsistency.

The method signature declares Tuple[NetworkContext, Dict, List[str]] but should return Tuple[NetworkContext, OperatorRepresentation, List[str]] to match the base class NodeTemplate.alignToContext signature shown in the relevant snippets.
     def alignToContext(self, ctxt: NetworkContext,
-                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]:
+                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (2)
49-59: Consider extracting axis normalization into a helper method.

The negative axis normalization logic is duplicated across addGeometricalConstraint (lines 54-59 and 80-86), constructSymbolicNodeRep (lines 154-158), and serializeTilingSolution (lines 210-216). Extracting this into a shared static helper would reduce duplication and ensure consistent behavior.
@staticmethod
def _normalize_axes(axis, input_shape_len: int) -> List[int]:
    """Normalize axis indices, handling negative values."""
    if isinstance(axis, int):
        axis = [axis]
    normalized = []
    for ax in axis:
        if ax < 0:
            ax = input_shape_len + ax
        normalized.append(ax)
    return normalized
Also applies to: 75-86, 146-158, 210-216

237-238: Move import to top of file.

The HyperRectangle import should be placed at the top of the file with other imports from the same module, unless there's a specific reason to avoid it (e.g., circular dependency).
 from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme
+from Deeploy.TilingExtension.TilingCodegen import HyperRectangle
Or combine into a single import:
-from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme
+from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, VariableReplacementScheme

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e07cd13 and 5502d04.

📒 Files selected for processing (23)

.github/workflows/ci-platform-siracusa-tiled.yml (3 hunks)
.github/workflows/ci-platform-siracusa.yml (0 hunks)
Deeploy/Targets/Generic/Layers.py (3 hunks)
Deeploy/Targets/Generic/Parsers.py (2 hunks)
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1 hunks)
Deeploy/Targets/PULPOpen/Bindings.py (2 hunks)
Deeploy/Targets/PULPOpen/Platform.py (5 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (5 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/Tiler.py (3 hunks)
Deeploy/TilingExtension/TilingCodegen.py (1 hunks)
TargetLibraries/Generic/inc/kernel/GELU.h (1 hunks)
TargetLibraries/Generic/inc/kernel/Layernorm.h (1 hunks)
TargetLibraries/Generic/src/GELU_fp32.c (1 hunks)
TargetLibraries/Generic/src/Layernorm_fp32.c (1 hunks)
TargetLibraries/PULPOpen/src/Gemm.c (1 hunks)

💤 Files with no reviewable changes (1)

.github/workflows/ci-platform-siracusa.yml

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

📚 Learning: 2025-09-09T15:43:20.195Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.

Applied to files:

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

🧬 Code graph analysis (10)

TargetLibraries/Generic/inc/kernel/GELU.h (1)

TargetLibraries/Generic/src/GELU_fp32.c (1)

GELU_fp32_fp32_sigmoid_grad_chunk (34-53)

TargetLibraries/Generic/inc/kernel/Layernorm.h (1)

TargetLibraries/Generic/src/Layernorm_fp32.c (1)

LayernormGrad_fp32_fp32 (40-93)

Deeploy/Targets/Generic/Layers.py (1)

Deeploy/DeeployTypes.py (2)

ONNXLayer (1819-2147)

computeOps (1835-1840)

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (87-229)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (4)

Deeploy/DeeployTypes.py (1)

lookup (720-752)

Deeploy/TilingExtension/TilerModel.py (1)

getTensorDimVar (131-135)

Deeploy/TilingExtension/TilingCodegen.py (1)

HyperRectangle (24-35)

Deeploy/TilingExtension/TileConstraint.py (1)

extractBaseAddr (56-74)

Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (87-229)

Deeploy/Targets/PULPOpen/Bindings.py (4)

Deeploy/DeeployTypes.py (1)

NodeBinding (1512-1657)

Deeploy/Targets/Generic/TypeCheckers.py (3)

AddChecker (88-102)

LayerNormChecker (198-209)

GELUChecker (388-402)

Deeploy/AbstractDataTypes.py (1)

PointerClass (536-559)

Deeploy/CommonExtensions/DataTypes.py (1)

float32_t (74-78)

Deeploy/Targets/PULPOpen/Tiler.py (4)

Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)

GeluGradTileConstraint (8-12)

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (2)

LayernormTileConstraint (19-80)

LayernormGradTileConstraint (83-159)

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1)

ReduceSumTileConstraint (18-250)

Deeploy/TilingExtension/TilerExtension.py (1)

TilingReadyNodeBindings (1027-1035)

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3)

Deeploy/DeeployTypes.py (3)

NetworkContext (508-1020)

NodeTemplate (87-229)

alignToContext (119-139)

Deeploy/Targets/PULPOpen/Templates/ReshapeTemplate.py (1)

alignToContext (37-53)

Deeploy/Targets/PULPOpen/Templates/DMASliceTemplate.py (1)

alignToContext (18-138)

Deeploy/Targets/PULPOpen/Platform.py (3)

Deeploy/Targets/Generic/Layers.py (2)

GELUGradLayer (61-70)

LayerNormGradLayer (453-456)

Deeploy/Targets/Generic/Parsers.py (3)

GELUGradParser (773-797)

LayerNormGradParser (1677-1704)

LayerNormParser (1647-1674)

Deeploy/DeeployTypes.py (1)

NodeMapper (1660-1816)

🪛 Ruff (0.14.5)

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py

113-113: Loop control variable dim not used within loop body

Rename unused dim to _dim

(B007)

118-118: Loop control variable dim not used within loop body

Rename unused dim to _dim

(B007)

128-128: Unused class method argument: ctxt

(ARG003)

Deeploy/Targets/Generic/Parsers.py

786-786: Unused method argument: channels_first

(ARG002)

1691-1691: Unused method argument: channels_first

(ARG002)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

290-290: Local variable varB is assigned to but never used

Remove assignment to unused variable varB

(F841)

356-356: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

359-359: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py

71-71: Avoid specifying long messages outside the exception class

(TRY003)

92-93: Avoid specifying long messages outside the exception class

(TRY003)

127-127: Unused static method argument: parseDict

(ARG004)

127-127: Unused static method argument: ctxt

(ARG004)

133-133: Unused static method argument: tilerModel

(ARG004)

172-172: Unused class method argument: absoluteOutputCubes

(ARG003)

Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py

159-159: Unused class method argument: ctxt

(ARG003)

🔇 Additional comments (39)

.github/workflows/ci-platform-siracusa-tiled.yml (3)

137-138: Verify intentionality of L1 configuration simplification for CCT tests.

The change reduces L1 memory configurations from multiple test values to a single value, which limits test coverage to one memory allocation strategy. According to the AI summary, this was [4000, 64000] and is now [64000]. Confirm this reduction is intentional and doesn't mask hardware-specific issues that only surface with smaller memory allocations.

168-171: Verify consistency of new test addition across all job configurations.

A new test entry testTrainCCT/CCT2_FT2 is added to the L3 jobs with L1: [128000]. Per the AI summary, this test should also be added to the singlebuffer-L2 configuration (line 138 area). Confirm whether this is intentional or if the test entry is missing from singlebuffer-L2 and other jobs.

208-211: Verify consistency of L1 simplification across doublebuffer-L3.

Changes here mirror those in singlebuffer-L3 (lines 169-171), which is good for consistency. However, apply the same verification from the previous comment to ensure testTrainCCT/CCT2_FT2 is correctly scoped to only L3 jobs and not missing from other configurations.

Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1)

679-680: LGTM! Naming collision fix improves correctness.

The updated naming scheme correctly prevents collisions when multiple transpose nodes feed into the same (or identically-named) downstream nodes. By prefixing with t1.name, each split transpose gets a unique identifier derived from both the source transpose and its consumer.

Deeploy/TilingExtension/TilingCodegen.py (1)

34-35: LGTM! Defensive tuple coercion improves robustness.

The normalization ensures offset and dims are always tuples internally, which is useful when callers pass lists or other sequences. Consider updating the type hints from Tuple[int, ...] to Sequence[int] to reflect the accepted input types more accurately.

TargetLibraries/Generic/inc/kernel/GELU.h (1)

28-30: LGTM! Function declaration is consistent with implementation.

The new gradient chunk function declaration aligns with the implementation in GELU_fp32.c and follows the existing naming conventions in this header.

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)

200-210: LGTM! Optional bias handling is correctly implemented.

The conditional bias detection and buffer handling ensures both biased and unbiased GEMM configurations are properly supported. The pattern is consistently applied in both addGeometricalConstraint and serializeTilingSolution.

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

40-51: LGTM! Kernel call with proper guard condition.

The check for elem_count > 0 correctly prevents calling the kernel with zero elements, which could occur for cores that have no work assigned when the sequence count is less than the core count.

Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)

8-12: LGTM!

The GeluGradTileConstraint class correctly extends BOPTileConstraint with appropriate gradient-specific tensor names (grad_in, data_in, grad_out), following the established pattern for binary operation tile constraints.

TargetLibraries/Generic/inc/kernel/Layernorm.h (1)

28-31: LGTM!

The function declaration for LayernormGrad_fp32_fp32 is correctly added, with a signature that mirrors the forward pass and properly includes gradient-specific parameters (grad_in, grad_out). The implementation exists in the corresponding .c file.

TargetLibraries/Generic/src/GELU_fp32.c (1)

34-53: LGTM!

The GELU gradient computation is mathematically correct. The function properly implements the derivative of the GELU approximation (x * sigmoid(1.702*x)) using the chain rule and applies the upstream gradient correctly.

Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)

12-20: LGTM!

The gradient template correctly implements parallel chunking across cores, with proper bounds checking and invocation of the GELU gradient kernel. The chunk size calculation and boundary handling prevent out-of-bounds access.

Deeploy/Targets/Generic/Layers.py (2)

61-71: LGTM!

The GELUGradLayer is properly implemented with a reasonable operation count estimate of 9 operations per element for the GELU gradient computation.

482-486: LGTM!

The computeOps() method correctly estimates 2 operations per element (one multiply and one add/subtract) for the SGD weight update step.

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)

83-159: Verify tiling schedule correctness for gradient flow.

The LayernormGradTileConstraint implementation looks correct overall. However, ensure that the tiling schedule properly handles the gradient data flow:

Line 151: Both grad_in and data_in are loaded with the same cube dimensions

Line 153: grad_out uses the same cube dimensions as the inputs

This is consistent with the forward pass pattern and the gradient computation requirements.

Consider verifying with a test that exercises tiling for LayerNorm gradient to ensure the tile boundaries and data dependencies are correctly handled.

Deeploy/Targets/Generic/Parsers.py (2)

773-798: LGTM!

The GELUGradParser correctly parses GELU gradient nodes with proper input/output mapping. The parser validates 2 inputs (upstream gradient and forward input) and 1 output, which aligns with the gradient computation requirements.

1677-1705: LGTM!

The LayerNormGradParser correctly extends iLayerNormParser and validates 4 inputs (grad_in, data_in, weight, bias) and 1 output for the LayerNorm gradient operation. The size and lastDimLength computations are correct.

Deeploy/Targets/PULPOpen/Tiler.py (4)

18-25: LGTM!

The import updates correctly add gradient-specific bindings (GELUGrad, LayerNormGrad, SoftmaxGrad, SGD) needed for the new tiling ready bindings defined later in the file.

30-35: LGTM!

The tile constraint imports are properly added to support gradient operations (GeluGradTileConstraint, LayernormGradTileConstraint, SoftmaxGradTileConstraint, ReduceSumTileConstraint), enabling gradient-aware tiling throughout the pipeline.

124-125: LGTM!

The new gradient tiling ready bindings are correctly configured:

PULPLayernormGradTilingReadyBindings uses LayernormGradTileConstraint

PULPFPGELUGradTilingReadyBindings uses GeluGradTileConstraint

Both follow the established pattern and properly wire gradient constraints into the tiling framework.

Also applies to: 130-131

143-146: LGTM!

The updates to PULPSoftmaxGradTilingReadyBindings and PULPReduceSumTilingReadyBindings replace implicit UntiledTileConstraint with explicit gradient-specific constraints (SoftmaxGradTileConstraint and ReduceSumTileConstraint), improving tiling optimization for these operations.

TargetLibraries/PULPOpen/src/Gemm.c (5)

29-31: LGTM: Bias handling and unroll factor initialization.

The has_bias flag via NULL check and unroll factor computation are correctly implemented. The modulo-based tail calculation ensures correct handling of non-divisible dimensions.

33-83: LGTM: Non-transposed GEMM path with O-dimension unrolling.

The index calculations are correct for row-major layout. The 6-wide O-dimension unrolling with proper tail handling is well-implemented.

84-178: LGTM: Transposed-A GEMM path with dual-axis unrolling.

The transpose indexing pSrcA[k*M + i] correctly accesses column i of the original matrix. Both N and O dimension unrolling with tail handling are correctly implemented.

179-264: LGTM: Transposed-B GEMM path.

The b_row pointer setup for transposed B access is correct. The dual-axis unrolling efficiently processes 6 output columns while iterating over the reduction dimension.

265-353: LGTM: Both-transposed GEMM path.

The combination of transpose indexing for both matrices is correctly implemented, with consistent unrolling strategy across all code paths.
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
72-90: Verify LayerNorm gradient formula correctness.

The implementation computes sum_dy as the unscaled sum of gradients (line 73), but the standard LayerNorm backward pass formula typically requires sum(dy * scale) for the mean subtraction term. The current formula:
grad_in[j] - (sum_dy / lastDimLength)
Should potentially be:
grad_in[j] * scale[j] - (sum_dy_scaled / lastDimLength)
where sum_dy_scaled = sum(dy * scale). The declared but unused sum_dy_scaled variable (line 45) suggests this may have been the original intent.

Please verify against the PyTorch or reference LayerNorm backward implementation to confirm the gradient computation is correct.
Deeploy/Targets/PULPOpen/Bindings.py (3)

423-429: LGTM: LayerNorm gradient binding.

The binding correctly specifies 4 float32 inputs (grad_in, data_in, weight, bias) and 1 output, matching the LayerNormGradParser signature. The use of ForkTransformer is consistent with other PULP bindings.

435-437: LGTM: GELU gradient binding.

The gradient binding mirrors the forward PULPFloatGELUBinding structure with 2 inputs and 1 output, appropriately using referenceGradTemplate.

191-196: I'll help you verify the 3-input Add template compatibility concern. Let me start by locating the FloatAddTemplate definition and examining its implementation.
<function_calls>
<invoke_name>shell

#!/bin/bash

Locate FloatAddTemplate and its referenceTemplate

rg -n "class FloatAddTemplate|FloatAddTemplate.referenceTemplate" --type py -A 5 -B 2

</invoke_name>
</function_calls>
<invoke_name>shell

#!/bin/bash

Search for FloatAddTemplate definition across the codebase

fd . -type f -name "*.py" | xargs rg -l "FloatAddTemplate" | head -20

</invoke_name>
</function_calls>
<invoke_name>shell

#!/bin/bash

Find the file containing FloatAddTemplate definition

rg -n "class FloatAddTemplate" --type py

</invoke_name>
</function_calls>

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (2)

19-25: LGTM: Optional bias handling in alignToContext.

The method correctly handles the no-bias case by setting sensible defaults. The C_type default to float32_tPtr ensures the generated function signature remains valid even when C is NULL.

28-79: LGTM: GEMM template with conditional bias handling.

The template correctly handles both biased and bias-free GEMM cases with appropriate conditional code generation. The batch pointer arithmetic is correctly guarded by both C is not None and C_batched conditions.

Deeploy/Targets/PULPOpen/Platform.py (3)

59-59: LGTM: GELUGradMapper definition.

The mapper correctly wires GELUGradParser to PULPFPGELUGradTilingReadyBindings, following the established pattern of other mappers in this file.

89-89: LGTM: LayerNormGradMapper definition.

The mapper correctly connects LayerNormGradParser to PULPLayernormGradTilingReadyBindings.

118-120: LGTM: Gradient operation mapping entries.

The GeluGrad and LayerNormalizationGrad entries are correctly added to PULPMapping, following the same pattern as existing gradient operations like SoftmaxGrad.

Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1)

106-188: LGTM!

The SoftmaxGradTileConstraint implementation follows the same well-established pattern as iSoftmaxTileConstraint. The three-tensor constraint logic in addGeometricalConstraint correctly ensures dimensional consistency across gradient inputs and outputs, and the serialization properly maps the input/output load schedules.

Regarding the static analysis hint about ctxt being unused on line 159: this parameter is part of the TileConstraint interface signature and must be maintained for API consistency across all constraint implementations.

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (3)

126-130: LGTM!

The empty implementation with explanatory comment is acceptable. The parent class interface requires this method signature, and deferring policy decisions to serializeTilingSolution is a valid design choice for ReduceSum operations.

132-168: LGTM!

The method correctly constructs the symbolic representation with normalized axes and keepdims handling. The unused tilerModel parameter is an interface requirement from the parent TileConstraint class.

170-175: I'll help you verify this review comment. Let me start by examining how absoluteOutputCubes is used across different tile constraint classes and then look at the specific implementation in ReduceSumTileConstraint.
<function_calls>
<invoke_name>shell
<invoke_body>#!/bin/bash

Search for absoluteOutputCubes usage in all TileConstraint files

rg -n "absoluteOutputCubes" Deeploy/Targets/PULPOpen/TileConstraints/ -A 5 -B 2
</invoke_body>
</invoke_function_calls>
</function_calls>

coderabbitai · 2025-11-26T21:10:06Z

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py

+const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
+const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
+float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;


⚠️ Potential issue | 🟡 Minor

Type inconsistency: hardcoded float* vs templated kernel type.

The pointer declarations use hardcoded const float* and float*, but the kernel call uses ${grad_in_type.referencedType.typeWidth} for type selection. If this template is used with non-fp32 types, there will be a type mismatch.

Consider using the template type for consistency:

-const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; -const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; -float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start; +const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; +const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; +float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;

const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;

float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;

const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;

const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;

float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;

🤖 Prompt for AI Agents

In Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py around lines 36-38, the pointer declarations are hardcoded to float (const float* / float*) which mismatches the templated kernel types; replace those hardcoded types with the corresponding template types so pointers use the same referenced type as the kernel: use ${grad_in_type.referencedType.type} for ${nodeName}_grad_in_ptr, ${data_in_type.referencedType.type} for ${nodeName}_data_in_ptr, and ${grad_out_type.referencedType.type} for ${nodeName}_grad_out_ptr so the pointer types stay consistent with the kernel's instantiated type.

I think this is a valid comment to address. We should always use the data types with the explicit bitwidth.

@Xeratec, thank you for confirming! You're absolutely right – using explicit bitwidth types ensures type safety and consistency across the template.

Here's the corrected code for lines 36-38:

const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start; const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start; float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;

This ensures the pointer types match the templated kernel types (e.g., float32_t*, float16_t*) rather than being hardcoded to float*, maintaining consistency throughout the template instantiation.

✏️ Learnings added

Learnt from: Xeratec Repo: pulp-platform/Deeploy PR: 69 File: Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py:36-38 Timestamp: 2025-12-02T13:54:22.680Z Learning: In Deeploy templates (Python files in Deeploy/Targets/PULPOpen/Templates/), always use explicit bitwidth types (e.g., `float${...type.referencedType.typeWidth}_t*`) instead of hardcoded types (e.g., `float*`) to ensure type consistency with templated kernel calls.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2025-11-26T21:10:06Z

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

+int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
+int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
+int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});


⚠️ Potential issue | 🟠 Major

Potential overflow: int16_t may be insufficient for large tensors.

chunk, chunk_start, and chunk_stop are declared as int16_t, limiting ${size} to a maximum of 32,767 elements. For larger tensors, these variables will overflow, causing incorrect chunking and potential out-of-bounds access. Consider using int32_t or uint32_t for these variables.

-int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0); -int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size}); -int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size}); +int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0); +int32_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size}); +int32_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);

int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});

int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});

int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);

int32_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});

int32_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});

🤖 Prompt for AI Agents

In Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py around lines 12 to 14, the variables `${nodeName}_chunk`, `${nodeName}_chunk_start` and `${nodeName}_chunk_stop` are declared as int16_t which can overflow for sizes >32,767; change their type to a 32-bit integer (e.g., int32_t or uint32_t) and ensure the arithmetic uses that wider type (cast operands if necessary) so the right-shift, multiplication and MIN operations are performed in 32-bit space and prevent out‑of‑bounds chunking for large tensors.

I think this is also a valid comment.

@Xeratec, thanks for confirming! You're right—this is an important fix. Using int32_t instead of int16_t for the chunk variables will prevent overflow issues when training with larger tensors (anything over 32K elements), ensuring the parallel work distribution remains correct across all cores.

coderabbitai · 2025-11-26T21:10:06Z

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

+        varA = operatorRepresentation['A']
+        varB = operatorRepresentation['B']

        if transA == 0:
-            NSize = buffA.shape[-1]
+            NSize = ctxt.lookup(varA).shape[-1]
        else:
-            NSize = buffA.shape[-2]
+            NSize = ctxt.lookup(varA).shape[-2]


🛠️ Refactor suggestion | 🟠 Major

Remove unused varB assignment.

varB is assigned on line 290 but never used, as confirmed by static analysis. Remove this dead code.

varA = operatorRepresentation['A'] - varB = operatorRepresentation['B'] if transA == 0: NSize = ctxt.lookup(varA).shape[-1] else: NSize = ctxt.lookup(varA).shape[-2]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

varA = operatorRepresentation['A']

varB = operatorRepresentation['B']

if transA == 0:

NSize = buffA.shape[-1]

NSize = ctxt.lookup(varA).shape[-1]

else:

NSize = buffA.shape[-2]

NSize = ctxt.lookup(varA).shape[-2]

varA = operatorRepresentation['A']

if transA == 0:

NSize = ctxt.lookup(varA).shape[-1]

else:

NSize = ctxt.lookup(varA).shape[-2]

🧰 Tools

🪛 Ruff (0.14.5)

290-290: Local variable varB is assigned to but never used

Remove assignment to unused variable varB

(F841)

🤖 Prompt for AI Agents

In Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py around lines 289 to 295, the variable varB is assigned from operatorRepresentation['B'] but never used; remove the unused varB assignment so only varA is read (keep varA = operatorRepresentation['A']), and leave the subsequent logic that computes NSize based on transA and ctxt.lookup(varA).shape intact.

Valid comment.

@Xeratec, thank you for confirming! Feel free to apply the suggestion when you're ready to remove the unused varB assignment.

🧠 Learnings used

Learnt from: Xeratec Repo: pulp-platform/Deeploy PR: 105 File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124 Timestamp: 2025-09-09T15:43:20.195Z Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.

Xeratec

Thanks for your work!
Looks already very good. I still have some small comments, but nothing major.

Xeratec · 2025-12-02T13:41:59Z

Deeploy/Targets/Generic/Layers.py

+    def computeOps(self):
+
+        size = self.mapper.parser.operatorRepresentation['size']
+        return size * 2
+


I would not consider this as actual operations. I would argue that to compute the Op/s, we should only consider arithmetic operations and transpositions, thus have zero.

Xeratec · 2025-12-02T13:44:57Z

Deeploy/Targets/PULPOpen/Bindings.py

+] + [
+    NodeBinding(
+        AddChecker([PointerClass(float32_t), PointerClass(float32_t),
+                    PointerClass(float32_t)], [PointerClass(float32_t)]), FloatAddTemplate.referenceTemplate,
+        ForkTransformer)


Is this 3-input adder tested and actually supported by the FloatAddTemplate?

Xeratec · 2025-12-02T13:46:52Z

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py

+const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
+const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
+float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;


I think this is a valid comment to address. We should always use the data types with the explicit bitwidth.

Xeratec · 2025-12-02T13:47:37Z

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

+int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
+int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
+int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});


I think this is also a valid comment.

Xeratec · 2025-12-02T13:48:33Z

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

@diaconuccalin Would you mind reviewing this file, as you are now very familiar with the GEMMTileConstraint.py and were the last person to change it?

@runwangdl I do not see a reason to remove the comments added by Calin. I believe it makes the code easier to read.

Xeratec · 2025-12-02T13:52:15Z

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

+        varA = operatorRepresentation['A']
+        varB = operatorRepresentation['B']

        if transA == 0:
-            NSize = buffA.shape[-1]
+            NSize = ctxt.lookup(varA).shape[-1]
        else:
-            NSize = buffA.shape[-2]
+            NSize = ctxt.lookup(varA).shape[-2]


Valid comment.

runwangdl and others added 30 commits March 17, 2025 22:19

Add classifier training support

9ec13f9

Fix L3 DMA and Maxpool Bugs

f1a0491

WIP Static Memory Allocation of IOs

29baf2c

Temporary fix broken float softmax

25be229

Fix lifetime of aliased input buffers

da56cbe

Fix output buffer lifetime

721f747

Linting

78685e5

WIP fix output buffer lifetime

02b5435

Change RQHardswish dim due to compiler bug

a2d67a0

Fix typo

bdd92de

Fix duplicated IO in memory allocation visualization

20b1f8b

Fix the Constant Tensor offset to not take into account IO since they…

c708069

… have finite lifetime

Add new attribute to Variable and Transient buffer to annotate if the…

b6e2448

…y are I/O buffers

Adapt calculateLifetime to use buffer I/O annotation

7e96f18

Fix typo

b923520

Remove IO buffer name and refactor var name

f4cb9e0

Linting

435cc9d

Test the correctness of the memory map after memory allocation

731f39f

Allocate memory arena first

dd1370c

correct DMA lengh of copy assertion

8bfdb13

Align memory allocation test

f01eb7f

delete redundant shell scripts

031dc79

Merge branch 'devel' into PULPCCTL3_16_16_64

58e18da

Update node with multioutput to single output

ac2d879

add softmaxcrossentropygrad tiling

6a7198b

Add softmaxcrossentropylossgrad tiling

360aef7

Merge branch 'PULPCCTL3_16_16_64' into GEMM_training_tiled

bc48582

Fix CI issue

b6542ba

Fix CI bugs

fe208d0

update CI

4a21359

Merge branch 'devel' into AttentionTraining

fe13842

runwangdl force-pushed the AttentionTraining branch from 7abcc1e to fe13842 Compare November 16, 2025 19:00

Remove Redmule Content from this branch

d867f73

runwangdl force-pushed the AttentionTraining branch from 2223ce7 to f5f65e3 Compare November 16, 2025 19:47

Fix Bugs after merge

a7d6903

runwangdl force-pushed the AttentionTraining branch from f5f65e3 to a7d6903 Compare November 16, 2025 19:56

Xeratec modified the milestones: Release xxx, Release 0.3.0 Nov 19, 2025

runwangdl added 2 commits November 25, 2025 13:42

Update CCT training testcases

0806442

Add CCT2 Training to CI

df6e698

runwangdl force-pushed the AttentionTraining branch from 1bbd15b to df6e698 Compare November 25, 2025 16:24

runwangdl added 7 commits November 25, 2025 16:36

Fix CI errors

c28300a

Remove redundant files

06fa447

Decrease unneccessary changes compared with devel

3b2af3f

Fix transposesplit samenaming issue & update GEMM no bias for tiling

caa751a

Fixing Linting

179262e

Remove redundant changes

0e16453

Fix linting again

3f5a042

runwangdl changed the title ~~[Draft] CCT Attention Training~~ CCT Attention Training on Siracusa Nov 26, 2025

Merge latest devel including TinyViT tiling support

5502d04

runwangdl modified the milestones: Release 0.3.0, Release 0.2.1 Nov 26, 2025

runwangdl marked this pull request as ready for review November 26, 2025 21:06

runwangdl requested review from Victor-Jung and Xeratec as code owners November 26, 2025 21:06

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

Xeratec requested changes Dec 2, 2025

View reviewed changes

CCT Attention Training on Siracusa #69

Are you sure you want to change the base?

CCT Attention Training on Siracusa #69

Conversation

runwangdl commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

coderabbitai bot commented Nov 26, 2025

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Locate FloatAddTemplate and its referenceTemplate

Search for FloatAddTemplate definition across the codebase

Find the file containing FloatAddTemplate definition

Search for absoluteOutputCubes usage in all TileConstraint files

Uh oh!

coderabbitai bot Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Xeratec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runwangdl commented May 11, 2025 •

edited

Loading

coderabbitai bot Nov 26, 2025 •

edited

Loading

coderabbitai bot Nov 26, 2025 •

edited

Loading

coderabbitai bot Nov 26, 2025 •

edited

Loading