Skip to content

Conversation

@runwangdl
Copy link
Collaborator

@runwangdl runwangdl commented May 11, 2025

This PR introduces gradient operator support, improved GEMM performance, and updates to the CCT training workflow. It also includes fixes to tile constraints and naming consistency in transpose pass.

Added

  • Implemented LayerNormGrad and GeluGrad operator parser, binding, template, and tile constraints.
  • Added CCT linear probing, LoRA, and full backpropagation training graph.

Changed

  • Optimized float GEMM kernel with loop unrolling and improved transpose handling.

Fixed

  • Corrected float GEMM tile constraints and templates for no bias.
  • Fixed transpose splitting pass logic: updated repeated naming by appending new identifiers derived from source and destination nodes.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@runwangdl runwangdl changed the title [Draft] CCT Attention Training CCT Attention Training on Siracusa Nov 26, 2025
@runwangdl runwangdl marked this pull request as ready for review November 26, 2025 21:06
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 26, 2025

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added support for gradient operations including GELU, LayerNorm, ReduceSum, and Softmax gradients for enhanced backpropagation capabilities.
  • Performance Improvements

    • Optimized GEMM computation with optional bias handling and unrolled loops for better throughput.
    • Enhanced SGD optimizer with loop unrolling and separated multiplication operations.
  • Bug Fixes

    • Improved handling of optional tensors in tiling constraints and memory management.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

This PR adds support for GELU and LayerNorm gradient operations throughout the Deeploy compiler stack, including parsers, layers, tile constraints, templates, and kernel implementations. It also includes optimizations to GEMM (optional bias handling) and SGD (loop unrolling), updates CI configurations for Siracusa platforms, and refines transpose node naming in topology optimization passes.

Changes

Cohort / File(s) Summary
CI Configuration Updates
.github/workflows/ci-platform-siracusa-tiled.yml, .github/workflows/ci-platform-siracusa.yml
Modified L1 cache configurations for CCT tests; replaced multi-element L1 arrays with single values; removed testTrainCCT/CCT1_Classifier_Training/CCT_1_16_16_8; added testTrainCCT/CCT2_FT2 entries
Generic Gradient Layer & Parser Infrastructure
Deeploy/Targets/Generic/Layers.py, Deeploy/Targets/Generic/Parsers.py
Added GELUGradLayer and LayerNormGradLayer classes; added computeOps method to SGDLayer; implemented GELUGradParser and LayerNormGradParser with parseNode/parseNodeCtxt methods
Topology Optimization
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py
Modified transpose node naming in _split_transposes_fun to prefix generated node names with first transpose node name
PULPOpen Bindings & Platform Integration
Deeploy/Targets/PULPOpen/Bindings.py, Deeploy/Targets/PULPOpen/Platform.py
Added PULPLayernormGradBinding, PULPFloatGELUGradBinding, and 3-input float addition pattern; wired GELUGrad/LayerNormGrad mappers and layers into platform mapping
Gradient Computation Templates
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py, Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py
Added referenceGradTemplate for parallel GELU gradient computation with chunk-based processing; added referenceGradTemplate for LayerNorm gradient with thread routing and conditional elem_count evaluation
Optimized Arithmetic Templates
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
Refactored SGD update with loop unrolling (6-wide blocks), chunk-based processing, temporary multiplication buffer, and post-unroll tail handling
GEMM Template & Tiling
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py, Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
Created PULPFloatGEMMTemplate class with alignToContext method; made bias tensor (C) optional with conditional code paths and tiling serialization logic
Gradient Tile Constraints
Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py, Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py, Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py
Added GeluGradTileConstraint, LayernormGradTileConstraint with geometry/policy/serialization methods; added SoftmaxGradTileConstraint with upstream_grad/softmax_output input handling
ReduceSum Tile Constraint
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py
Introduced comprehensive ReduceSumTileConstraint with axis/keepdims handling, symbolic representation, and tiling serialization for reduction operations
Tiling Infrastructure
Deeploy/Targets/PULPOpen/Tiler.py, Deeploy/TilingExtension/TilingCodegen.py
Added PULPLayernormGradTilingReadyBindings and PULPFPGELUGradTilingReadyBindings; replaced UntiledTileConstraint with specific constraints for SoftmaxGrad/ReduceSum; normalized HyperRectangle offset/dims to tuples
Gradient Kernel Headers & Implementations
TargetLibraries/Generic/inc/kernel/GELU.h, TargetLibraries/Generic/src/GELU_fp32.c, TargetLibraries/Generic/inc/kernel/Layernorm.h, TargetLibraries/Generic/src/Layernorm_fp32.c
Added GELU_fp32_fp32_sigmoid_grad_chunk and LayernormGrad_fp32_fp32 function declarations and implementations with gradient computation logic
GEMM Kernel Optimization
TargetLibraries/PULPOpen/src/Gemm.c
Refactored GEMM implementation with bias flag handling, 6-wide loop unrolling, separate code paths for transpose combinations, and blocked accumulations

Sequence Diagram(s)

sequenceDiagram
    participant Compiler as Deeploy Compiler
    participant Parser as GELUGrad/LayerNormGrad<br/>Parser
    participant Layer as Gradient Layer
    participant Mapper as Tiling Mapper
    participant TileConstraint as Tile Constraint
    participant Template as Gradient Template
    participant Kernel as Target Kernel

    Compiler->>Parser: Parse gradient node (GELUGrad/LayerNormGrad)
    Parser->>Parser: Extract inputs (grad_in, data_in, etc.)
    Parser->>Layer: Create Layer instance
    Layer->>Mapper: Apply mapper with parsed operatorRepresentation
    Mapper->>TileConstraint: Initialize tile constraint
    TileConstraint->>TileConstraint: Add geometrical/policy constraints
    TileConstraint->>Template: Serialize tiling solution
    Template->>Template: Generate parallel chunk-based code
    Template->>Kernel: Invoke gradient kernel (chunk range)
    Kernel->>Kernel: Compute per-element gradient
    Kernel->>Compiler: Return grad_out
Loading
sequenceDiagram
    participant Old as Old SGD/GEMM
    participant Codegen as Code Generator
    participant Unroll as Unrolled Implementation
    participant Kernel as Optimized Kernel

    Old->>Codegen: Simple per-element loop
    Codegen->>Codegen: Single operation per iteration
    Codegen->>Kernel: Generate baseline code

    alt New SGD/GEMM Path
        Codegen->>Unroll: Extract chunks (6-wide blocks)
        Unroll->>Unroll: Batch multiply (6 elements)
        Unroll->>Unroll: Batch update weights/accumulate
        Unroll->>Kernel: Process tail elements
        Kernel->>Codegen: Optimized execution
    end

    alt GEMM Optional Bias
        Codegen->>Unroll: Check has_bias flag
        alt Bias present
            Unroll->>Kernel: Add per-row bias
        else No bias
            Unroll->>Kernel: Pass NULL for C
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Dense kernel implementations: GELU_fp32_fp32_sigmoid_grad_chunk and LayernormGrad_fp32_fp32 require verification of gradient math formulas and numerical correctness
  • Multiple tiling constraints: GeluGradTileConstraint, LayernormGradTileConstraint, ReduceSumTileConstraint, and SoftmaxGradTileConstraint introduce new constraint logic requiring careful validation of dimension propagation and scheduling
  • Significant template refactoring: SGD and GEMM templates contain substantial loop unrolling and multi-path branching logic that needs verification for correctness and performance
  • End-to-end integration: Gradient support is wired across parsers → layers → mappers → tile constraints → templates → kernels, requiring tracing of data flow through multiple abstraction layers
  • Optional bias handling in GEMM: Conditional code paths and serialization logic in GEMMTileConstraint and FloatGemmTemplate require careful inspection for edge cases

Possibly related PRs

Suggested reviewers

  • Victor-Jung
  • Xeratec
  • lukamac

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'CCT Attention Training on Siracusa' is specific and directly reflects the main change: updates to CCT training workflow on the Siracusa platform, evident from CI workflow modifications and added training graph support.
Description check ✅ Passed The description comprehensively covers all major changes including gradient operators, GEMM optimization, CCT training updates, and bug fixes, directly relating to the changeset across multiple files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Tip

✨ Issue Enrichment is now available for GitHub issues!

CodeRabbit can now help you manage issues more effectively:

  • Duplicate Detection — Identify similar or duplicate issues
  • Related Issues & PRs — Find relevant issues and PR's from your repository
  • Suggested Assignees — Find the best person to work on the issue
  • Implementation Planning — Generate detailed coding plans for engineers and agents
Disable automatic issue enrichment

To disable automatic issue enrichment, add the following to your .coderabbit.yaml:

issue_enrichment:
  auto_enrich:
    enabled: false

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)

26-42: Unrolling factor of 6 is functional but unconventional.

The loop unrolling works correctly. However, an unroll factor of 4 or 8 typically aligns better with SIMD widths and cache lines on most architectures. If performance is critical, consider benchmarking with power-of-2 unroll factors.

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)

355-360: Consider adding strict=True to zip() calls for robustness.

While the lists are constructed together and should always match in length, adding strict=True (Python 3.10+) would catch any future bugs where list lengths diverge. This is optional given the current structure guarantees equal lengths.

         if has_bias:
-            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes):
+            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict=True):
                 inputLoadSchedule.append({"A": a, "B": b, "C": c})
         else:
-            for a, b in zip(inputACubes, inputBCubes):
+            for a, b in zip(inputACubes, inputBCubes, strict=True):
                 inputLoadSchedule.append({"A": a, "B": b})
Deeploy/Targets/Generic/Layers.py (1)

453-457: Consider adding computeOps() method for consistency.

The LayerNormGradLayer class doesn't implement computeOps(), while other gradient layers like GELUGradLayer (Line 66) and SoftmaxGradLayer (Line 134) do. For consistency and accurate operation counting in analysis workflows, consider adding this method.

Would you like me to help estimate the operation count for LayerNorm gradient based on the kernel implementation in TargetLibraries/Generic/src/Layernorm_fp32.c?

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)

113-121: Rename unused loop variables.

The loop variable dim is not used in the loop body on lines 113-117 and 118-121. Following Python conventions, rename it to _dim or simply remove it if only the index is needed.

Apply this diff:

-        for idx, dim in enumerate(input_shape):
+        for idx, _ in enumerate(input_shape):
             tilerModel.addConstraint(
                 tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) ==
                 tilerModel.getTensorDimVar(tensorName = grad_in_buffer_name, dimIdx = idx))
 
-        for idx, dim in enumerate(input_shape):
+        for idx, _ in enumerate(input_shape):
             tilerModel.addConstraint(
                 tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) ==
                 tilerModel.getTensorDimVar(tensorName = grad_out_buffer_name, dimIdx = idx))
TargetLibraries/Generic/src/Layernorm_fp32.c (1)

40-43: Unused bias parameter.

The bias parameter is declared but never used in the function body. LayerNorm gradient with respect to input typically doesn't require the bias term, but the signature inconsistency with the forward pass may cause confusion.

Consider either:

  1. Documenting why bias is intentionally unused (for API consistency with forward pass), or
  2. Removing it if not needed for gradient computation:
-void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in,
-                             float32_t *grad_out, float32_t *scale,
-                             float32_t *bias, float32_t epsilon, int32_t size,
-                             int32_t lastDimLength) {
+void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in,
+                             float32_t *grad_out, float32_t *scale,
+                             float32_t epsilon, int32_t size,
+                             int32_t lastDimLength) {
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)

16-17: Return type annotation inconsistency.

The method signature declares Tuple[NetworkContext, Dict, List[str]] but should return Tuple[NetworkContext, OperatorRepresentation, List[str]] to match the base class NodeTemplate.alignToContext signature shown in the relevant snippets.

     def alignToContext(self, ctxt: NetworkContext,
-                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]:
+                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (2)

49-59: Consider extracting axis normalization into a helper method.

The negative axis normalization logic is duplicated across addGeometricalConstraint (lines 54-59 and 80-86), constructSymbolicNodeRep (lines 154-158), and serializeTilingSolution (lines 210-216). Extracting this into a shared static helper would reduce duplication and ensure consistent behavior.

@staticmethod
def _normalize_axes(axis, input_shape_len: int) -> List[int]:
    """Normalize axis indices, handling negative values."""
    if isinstance(axis, int):
        axis = [axis]
    normalized = []
    for ax in axis:
        if ax < 0:
            ax = input_shape_len + ax
        normalized.append(ax)
    return normalized

Also applies to: 75-86, 146-158, 210-216


237-238: Move import to top of file.

The HyperRectangle import should be placed at the top of the file with other imports from the same module, unless there's a specific reason to avoid it (e.g., circular dependency).

 from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme
+from Deeploy.TilingExtension.TilingCodegen import HyperRectangle

Or combine into a single import:

-from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme
+from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, VariableReplacementScheme
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e07cd13 and 5502d04.

📒 Files selected for processing (23)
  • .github/workflows/ci-platform-siracusa-tiled.yml (3 hunks)
  • .github/workflows/ci-platform-siracusa.yml (0 hunks)
  • Deeploy/Targets/Generic/Layers.py (3 hunks)
  • Deeploy/Targets/Generic/Parsers.py (2 hunks)
  • Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1 hunks)
  • Deeploy/Targets/PULPOpen/Bindings.py (2 hunks)
  • Deeploy/Targets/PULPOpen/Platform.py (5 hunks)
  • Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1 hunks)
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3 hunks)
  • Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1 hunks)
  • Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1 hunks)
  • Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (5 hunks)
  • Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1 hunks)
  • Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1 hunks)
  • Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1 hunks)
  • Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1 hunks)
  • Deeploy/Targets/PULPOpen/Tiler.py (3 hunks)
  • Deeploy/TilingExtension/TilingCodegen.py (1 hunks)
  • TargetLibraries/Generic/inc/kernel/GELU.h (1 hunks)
  • TargetLibraries/Generic/inc/kernel/Layernorm.h (1 hunks)
  • TargetLibraries/Generic/src/GELU_fp32.c (1 hunks)
  • TargetLibraries/Generic/src/Layernorm_fp32.c (1 hunks)
  • TargetLibraries/PULPOpen/src/Gemm.c (1 hunks)
💤 Files with no reviewable changes (1)
  • .github/workflows/ci-platform-siracusa.yml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
📚 Learning: 2025-09-09T15:43:20.195Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.

Applied to files:

  • Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py
  • Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
🧬 Code graph analysis (10)
TargetLibraries/Generic/inc/kernel/GELU.h (1)
TargetLibraries/Generic/src/GELU_fp32.c (1)
  • GELU_fp32_fp32_sigmoid_grad_chunk (34-53)
TargetLibraries/Generic/inc/kernel/Layernorm.h (1)
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
  • LayernormGrad_fp32_fp32 (40-93)
Deeploy/Targets/Generic/Layers.py (1)
Deeploy/DeeployTypes.py (2)
  • ONNXLayer (1819-2147)
  • computeOps (1835-1840)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
Deeploy/DeeployTypes.py (1)
  • NodeTemplate (87-229)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (4)
Deeploy/DeeployTypes.py (1)
  • lookup (720-752)
Deeploy/TilingExtension/TilerModel.py (1)
  • getTensorDimVar (131-135)
Deeploy/TilingExtension/TilingCodegen.py (1)
  • HyperRectangle (24-35)
Deeploy/TilingExtension/TileConstraint.py (1)
  • extractBaseAddr (56-74)
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)
Deeploy/DeeployTypes.py (1)
  • NodeTemplate (87-229)
Deeploy/Targets/PULPOpen/Bindings.py (4)
Deeploy/DeeployTypes.py (1)
  • NodeBinding (1512-1657)
Deeploy/Targets/Generic/TypeCheckers.py (3)
  • AddChecker (88-102)
  • LayerNormChecker (198-209)
  • GELUChecker (388-402)
Deeploy/AbstractDataTypes.py (1)
  • PointerClass (536-559)
Deeploy/CommonExtensions/DataTypes.py (1)
  • float32_t (74-78)
Deeploy/Targets/PULPOpen/Tiler.py (4)
Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)
  • GeluGradTileConstraint (8-12)
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (2)
  • LayernormTileConstraint (19-80)
  • LayernormGradTileConstraint (83-159)
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1)
  • ReduceSumTileConstraint (18-250)
Deeploy/TilingExtension/TilerExtension.py (1)
  • TilingReadyNodeBindings (1027-1035)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3)
Deeploy/DeeployTypes.py (3)
  • NetworkContext (508-1020)
  • NodeTemplate (87-229)
  • alignToContext (119-139)
Deeploy/Targets/PULPOpen/Templates/ReshapeTemplate.py (1)
  • alignToContext (37-53)
Deeploy/Targets/PULPOpen/Templates/DMASliceTemplate.py (1)
  • alignToContext (18-138)
Deeploy/Targets/PULPOpen/Platform.py (3)
Deeploy/Targets/Generic/Layers.py (2)
  • GELUGradLayer (61-70)
  • LayerNormGradLayer (453-456)
Deeploy/Targets/Generic/Parsers.py (3)
  • GELUGradParser (773-797)
  • LayerNormGradParser (1677-1704)
  • LayerNormParser (1647-1674)
Deeploy/DeeployTypes.py (1)
  • NodeMapper (1660-1816)
🪛 Ruff (0.14.5)
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py

113-113: Loop control variable dim not used within loop body

Rename unused dim to _dim

(B007)


118-118: Loop control variable dim not used within loop body

Rename unused dim to _dim

(B007)


128-128: Unused class method argument: ctxt

(ARG003)

Deeploy/Targets/Generic/Parsers.py

786-786: Unused method argument: channels_first

(ARG002)


1691-1691: Unused method argument: channels_first

(ARG002)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

290-290: Local variable varB is assigned to but never used

Remove assignment to unused variable varB

(F841)


356-356: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


359-359: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py

71-71: Avoid specifying long messages outside the exception class

(TRY003)


92-93: Avoid specifying long messages outside the exception class

(TRY003)


127-127: Unused static method argument: parseDict

(ARG004)


127-127: Unused static method argument: ctxt

(ARG004)


133-133: Unused static method argument: tilerModel

(ARG004)


172-172: Unused class method argument: absoluteOutputCubes

(ARG003)

Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py

159-159: Unused class method argument: ctxt

(ARG003)

🔇 Additional comments (39)
.github/workflows/ci-platform-siracusa-tiled.yml (3)

137-138: Verify intentionality of L1 configuration simplification for CCT tests.

The change reduces L1 memory configurations from multiple test values to a single value, which limits test coverage to one memory allocation strategy. According to the AI summary, this was [4000, 64000] and is now [64000]. Confirm this reduction is intentional and doesn't mask hardware-specific issues that only surface with smaller memory allocations.


168-171: Verify consistency of new test addition across all job configurations.

A new test entry testTrainCCT/CCT2_FT2 is added to the L3 jobs with L1: [128000]. Per the AI summary, this test should also be added to the singlebuffer-L2 configuration (line 138 area). Confirm whether this is intentional or if the test entry is missing from singlebuffer-L2 and other jobs.


208-211: Verify consistency of L1 simplification across doublebuffer-L3.

Changes here mirror those in singlebuffer-L3 (lines 169-171), which is good for consistency. However, apply the same verification from the previous comment to ensure testTrainCCT/CCT2_FT2 is correctly scoped to only L3 jobs and not missing from other configurations.

Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1)

679-680: LGTM! Naming collision fix improves correctness.

The updated naming scheme correctly prevents collisions when multiple transpose nodes feed into the same (or identically-named) downstream nodes. By prefixing with t1.name, each split transpose gets a unique identifier derived from both the source transpose and its consumer.

Deeploy/TilingExtension/TilingCodegen.py (1)

34-35: LGTM! Defensive tuple coercion improves robustness.

The normalization ensures offset and dims are always tuples internally, which is useful when callers pass lists or other sequences. Consider updating the type hints from Tuple[int, ...] to Sequence[int] to reflect the accepted input types more accurately.

TargetLibraries/Generic/inc/kernel/GELU.h (1)

28-30: LGTM! Function declaration is consistent with implementation.

The new gradient chunk function declaration aligns with the implementation in GELU_fp32.c and follows the existing naming conventions in this header.

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)

200-210: LGTM! Optional bias handling is correctly implemented.

The conditional bias detection and buffer handling ensures both biased and unbiased GEMM configurations are properly supported. The pattern is consistently applied in both addGeometricalConstraint and serializeTilingSolution.

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

40-51: LGTM! Kernel call with proper guard condition.

The check for elem_count > 0 correctly prevents calling the kernel with zero elements, which could occur for cores that have no work assigned when the sequence count is less than the core count.

Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)

8-12: LGTM!

The GeluGradTileConstraint class correctly extends BOPTileConstraint with appropriate gradient-specific tensor names (grad_in, data_in, grad_out), following the established pattern for binary operation tile constraints.

TargetLibraries/Generic/inc/kernel/Layernorm.h (1)

28-31: LGTM!

The function declaration for LayernormGrad_fp32_fp32 is correctly added, with a signature that mirrors the forward pass and properly includes gradient-specific parameters (grad_in, grad_out). The implementation exists in the corresponding .c file.

TargetLibraries/Generic/src/GELU_fp32.c (1)

34-53: LGTM!

The GELU gradient computation is mathematically correct. The function properly implements the derivative of the GELU approximation (x * sigmoid(1.702*x)) using the chain rule and applies the upstream gradient correctly.

Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)

12-20: LGTM!

The gradient template correctly implements parallel chunking across cores, with proper bounds checking and invocation of the GELU gradient kernel. The chunk size calculation and boundary handling prevent out-of-bounds access.

Deeploy/Targets/Generic/Layers.py (2)

61-71: LGTM!

The GELUGradLayer is properly implemented with a reasonable operation count estimate of 9 operations per element for the GELU gradient computation.


482-486: LGTM!

The computeOps() method correctly estimates 2 operations per element (one multiply and one add/subtract) for the SGD weight update step.

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)

83-159: Verify tiling schedule correctness for gradient flow.

The LayernormGradTileConstraint implementation looks correct overall. However, ensure that the tiling schedule properly handles the gradient data flow:

  • Line 151: Both grad_in and data_in are loaded with the same cube dimensions
  • Line 153: grad_out uses the same cube dimensions as the inputs

This is consistent with the forward pass pattern and the gradient computation requirements.

Consider verifying with a test that exercises tiling for LayerNorm gradient to ensure the tile boundaries and data dependencies are correctly handled.

Deeploy/Targets/Generic/Parsers.py (2)

773-798: LGTM!

The GELUGradParser correctly parses GELU gradient nodes with proper input/output mapping. The parser validates 2 inputs (upstream gradient and forward input) and 1 output, which aligns with the gradient computation requirements.


1677-1705: LGTM!

The LayerNormGradParser correctly extends iLayerNormParser and validates 4 inputs (grad_in, data_in, weight, bias) and 1 output for the LayerNorm gradient operation. The size and lastDimLength computations are correct.

Deeploy/Targets/PULPOpen/Tiler.py (4)

18-25: LGTM!

The import updates correctly add gradient-specific bindings (GELUGrad, LayerNormGrad, SoftmaxGrad, SGD) needed for the new tiling ready bindings defined later in the file.


30-35: LGTM!

The tile constraint imports are properly added to support gradient operations (GeluGradTileConstraint, LayernormGradTileConstraint, SoftmaxGradTileConstraint, ReduceSumTileConstraint), enabling gradient-aware tiling throughout the pipeline.


124-125: LGTM!

The new gradient tiling ready bindings are correctly configured:

  • PULPLayernormGradTilingReadyBindings uses LayernormGradTileConstraint
  • PULPFPGELUGradTilingReadyBindings uses GeluGradTileConstraint

Both follow the established pattern and properly wire gradient constraints into the tiling framework.

Also applies to: 130-131


143-146: LGTM!

The updates to PULPSoftmaxGradTilingReadyBindings and PULPReduceSumTilingReadyBindings replace implicit UntiledTileConstraint with explicit gradient-specific constraints (SoftmaxGradTileConstraint and ReduceSumTileConstraint), improving tiling optimization for these operations.

TargetLibraries/PULPOpen/src/Gemm.c (5)

29-31: LGTM: Bias handling and unroll factor initialization.

The has_bias flag via NULL check and unroll factor computation are correctly implemented. The modulo-based tail calculation ensures correct handling of non-divisible dimensions.


33-83: LGTM: Non-transposed GEMM path with O-dimension unrolling.

The index calculations are correct for row-major layout. The 6-wide O-dimension unrolling with proper tail handling is well-implemented.


84-178: LGTM: Transposed-A GEMM path with dual-axis unrolling.

The transpose indexing pSrcA[k*M + i] correctly accesses column i of the original matrix. Both N and O dimension unrolling with tail handling are correctly implemented.


179-264: LGTM: Transposed-B GEMM path.

The b_row pointer setup for transposed B access is correct. The dual-axis unrolling efficiently processes 6 output columns while iterating over the reduction dimension.


265-353: LGTM: Both-transposed GEMM path.

The combination of transpose indexing for both matrices is correctly implemented, with consistent unrolling strategy across all code paths.

TargetLibraries/Generic/src/Layernorm_fp32.c (1)

72-90: Verify LayerNorm gradient formula correctness.

The implementation computes sum_dy as the unscaled sum of gradients (line 73), but the standard LayerNorm backward pass formula typically requires sum(dy * scale) for the mean subtraction term. The current formula:

grad_in[j] - (sum_dy / lastDimLength)

Should potentially be:

grad_in[j] * scale[j] - (sum_dy_scaled / lastDimLength)

where sum_dy_scaled = sum(dy * scale). The declared but unused sum_dy_scaled variable (line 45) suggests this may have been the original intent.

Please verify against the PyTorch or reference LayerNorm backward implementation to confirm the gradient computation is correct.

Deeploy/Targets/PULPOpen/Bindings.py (3)

423-429: LGTM: LayerNorm gradient binding.

The binding correctly specifies 4 float32 inputs (grad_in, data_in, weight, bias) and 1 output, matching the LayerNormGradParser signature. The use of ForkTransformer is consistent with other PULP bindings.


435-437: LGTM: GELU gradient binding.

The gradient binding mirrors the forward PULPFloatGELUBinding structure with 2 inputs and 1 output, appropriately using referenceGradTemplate.


191-196: I'll help you verify the 3-input Add template compatibility concern. Let me start by locating the FloatAddTemplate definition and examining its implementation.
<function_calls>
<invoke_name>shell

#!/bin/bash

Locate FloatAddTemplate and its referenceTemplate

rg -n "class FloatAddTemplate|FloatAddTemplate.referenceTemplate" --type py -A 5 -B 2

</invoke_name>
</function_calls>
<invoke_name>shell

#!/bin/bash

Search for FloatAddTemplate definition across the codebase

fd . -type f -name "*.py" | xargs rg -l "FloatAddTemplate" | head -20

</invoke_name>
</function_calls>
<invoke_name>shell

#!/bin/bash

Find the file containing FloatAddTemplate definition

rg -n "class FloatAddTemplate" --type py

</invoke_name>
</function_calls>

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (2)

19-25: LGTM: Optional bias handling in alignToContext.

The method correctly handles the no-bias case by setting sensible defaults. The C_type default to float32_tPtr ensures the generated function signature remains valid even when C is NULL.


28-79: LGTM: GEMM template with conditional bias handling.

The template correctly handles both biased and bias-free GEMM cases with appropriate conditional code generation. The batch pointer arithmetic is correctly guarded by both C is not None and C_batched conditions.

Deeploy/Targets/PULPOpen/Platform.py (3)

59-59: LGTM: GELUGradMapper definition.

The mapper correctly wires GELUGradParser to PULPFPGELUGradTilingReadyBindings, following the established pattern of other mappers in this file.


89-89: LGTM: LayerNormGradMapper definition.

The mapper correctly connects LayerNormGradParser to PULPLayernormGradTilingReadyBindings.


118-120: LGTM: Gradient operation mapping entries.

The GeluGrad and LayerNormalizationGrad entries are correctly added to PULPMapping, following the same pattern as existing gradient operations like SoftmaxGrad.

Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1)

106-188: LGTM!

The SoftmaxGradTileConstraint implementation follows the same well-established pattern as iSoftmaxTileConstraint. The three-tensor constraint logic in addGeometricalConstraint correctly ensures dimensional consistency across gradient inputs and outputs, and the serialization properly maps the input/output load schedules.

Regarding the static analysis hint about ctxt being unused on line 159: this parameter is part of the TileConstraint interface signature and must be maintained for API consistency across all constraint implementations.

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (3)

126-130: LGTM!

The empty implementation with explanatory comment is acceptable. The parent class interface requires this method signature, and deferring policy decisions to serializeTilingSolution is a valid design choice for ReduceSum operations.


132-168: LGTM!

The method correctly constructs the symbolic representation with normalized axes and keepdims handling. The unused tilerModel parameter is an interface requirement from the parent TileConstraint class.


170-175: I'll help you verify this review comment. Let me start by examining how absoluteOutputCubes is used across different tile constraint classes and then look at the specific implementation in ReduceSumTileConstraint.
<function_calls>
<invoke_name>shell
<invoke_body>#!/bin/bash

Search for absoluteOutputCubes usage in all TileConstraint files

rg -n "absoluteOutputCubes" Deeploy/Targets/PULPOpen/TileConstraints/ -A 5 -B 2
</invoke_body>
</invoke_function_calls>
</function_calls>

Comment on lines +36 to +38
const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type inconsistency: hardcoded float* vs templated kernel type.

The pointer declarations use hardcoded const float* and float*, but the kernel call uses ${grad_in_type.referencedType.typeWidth} for type selection. If this template is used with non-fp32 types, there will be a type mismatch.

Consider using the template type for consistency:

-const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
-const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
-float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
+const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
+const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
+float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py around lines
36-38, the pointer declarations are hardcoded to float (const float* / float*)
which mismatches the templated kernel types; replace those hardcoded types with
the corresponding template types so pointers use the same referenced type as the
kernel: use ${grad_in_type.referencedType.type} for ${nodeName}_grad_in_ptr,
${data_in_type.referencedType.type} for ${nodeName}_data_in_ptr, and
${grad_out_type.referencedType.type} for ${nodeName}_grad_out_ptr so the pointer
types stay consistent with the kernel's instantiated type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid comment to address. We should always use the data types with the explicit bitwidth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xeratec, thank you for confirming! You're absolutely right – using explicit bitwidth types ensures type safety and consistency across the template.

Here's the corrected code for lines 36-38:

const float${grad_in_type.referencedType.typeWidth}_t* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
const float${data_in_type.referencedType.typeWidth}_t* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
float${grad_out_type.referencedType.typeWidth}_t* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;

This ensures the pointer types match the templated kernel types (e.g., float32_t*, float16_t*) rather than being hardcoded to float*, maintaining consistency throughout the template instantiation.


✏️ Learnings added
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 69
File: Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py:36-38
Timestamp: 2025-12-02T13:54:22.680Z
Learning: In Deeploy templates (Python files in Deeploy/Targets/PULPOpen/Templates/), always use explicit bitwidth types (e.g., `float${...type.referencedType.typeWidth}_t*`) instead of hardcoded types (e.g., `float*`) to ensure type consistency with templated kernel calls.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +12 to +14
int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential overflow: int16_t may be insufficient for large tensors.

chunk, chunk_start, and chunk_stop are declared as int16_t, limiting ${size} to a maximum of 32,767 elements. For larger tensors, these variables will overflow, causing incorrect chunking and potential out-of-bounds access. Consider using int32_t or uint32_t for these variables.

-int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
-int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
-int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
+int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
+int32_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
+int32_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
int32_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
int32_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py around lines 12 to 14, the
variables `${nodeName}_chunk`, `${nodeName}_chunk_start` and
`${nodeName}_chunk_stop` are declared as int16_t which can overflow for sizes
>32,767; change their type to a 32-bit integer (e.g., int32_t or uint32_t) and
ensure the arithmetic uses that wider type (cast operands if necessary) so the
right-shift, multiplication and MIN operations are performed in 32-bit space and
prevent out‑of‑bounds chunking for large tensors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also a valid comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xeratec, thanks for confirming! You're right—this is an important fix. Using int32_t instead of int16_t for the chunk variables will prevent overflow issues when training with larger tensors (anything over 32K elements), ensuring the parallel work distribution remains correct across all cores.

Comment on lines +289 to +295
varA = operatorRepresentation['A']
varB = operatorRepresentation['B']

if transA == 0:
NSize = buffA.shape[-1]
NSize = ctxt.lookup(varA).shape[-1]
else:
NSize = buffA.shape[-2]
NSize = ctxt.lookup(varA).shape[-2]
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove unused varB assignment.

varB is assigned on line 290 but never used, as confirmed by static analysis. Remove this dead code.

         varA = operatorRepresentation['A']
-        varB = operatorRepresentation['B']

         if transA == 0:
             NSize = ctxt.lookup(varA).shape[-1]
         else:
             NSize = ctxt.lookup(varA).shape[-2]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
varA = operatorRepresentation['A']
varB = operatorRepresentation['B']
if transA == 0:
NSize = buffA.shape[-1]
NSize = ctxt.lookup(varA).shape[-1]
else:
NSize = buffA.shape[-2]
NSize = ctxt.lookup(varA).shape[-2]
varA = operatorRepresentation['A']
if transA == 0:
NSize = ctxt.lookup(varA).shape[-1]
else:
NSize = ctxt.lookup(varA).shape[-2]
🧰 Tools
🪛 Ruff (0.14.5)

290-290: Local variable varB is assigned to but never used

Remove assignment to unused variable varB

(F841)

🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py around lines
289 to 295, the variable varB is assigned from operatorRepresentation['B'] but
never used; remove the unused varB assignment so only varA is read (keep varA =
operatorRepresentation['A']), and leave the subsequent logic that computes NSize
based on transA and ctxt.lookup(varA).shape intact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xeratec, thank you for confirming! Feel free to apply the suggestion when you're ready to remove the unused varB assignment.


🧠 Learnings used
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.

Copy link
Member

@Xeratec Xeratec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!
Looks already very good. I still have some small comments, but nothing major.

Comment on lines +482 to +486
def computeOps(self):

size = self.mapper.parser.operatorRepresentation['size']
return size * 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not consider this as actual operations. I would argue that to compute the Op/s, we should only consider arithmetic operations and transpositions, thus have zero.

Comment on lines +191 to +195
] + [
NodeBinding(
AddChecker([PointerClass(float32_t), PointerClass(float32_t),
PointerClass(float32_t)], [PointerClass(float32_t)]), FloatAddTemplate.referenceTemplate,
ForkTransformer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 3-input adder tested and actually supported by the FloatAddTemplate?

Comment on lines +36 to +38
const float* ${nodeName}_grad_in_ptr = ${grad_in} + ${nodeName}_elem_start;
const float* ${nodeName}_data_in_ptr = ${data_in} + ${nodeName}_elem_start;
float* ${nodeName}_grad_out_ptr = ${grad_out} + ${nodeName}_elem_start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid comment to address. We should always use the data types with the explicit bitwidth.

Comment on lines +12 to +14
int16_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
int16_t ${nodeName}_chunk_start = MIN(${nodeName}_chunk*${nodeName}_core_id, ${size});
int16_t ${nodeName}_chunk_stop = MIN(${nodeName}_chunk_start + ${nodeName}_chunk, ${size});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also a valid comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diaconuccalin Would you mind reviewing this file, as you are now very familiar with the GEMMTileConstraint.py and were the last person to change it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@runwangdl I do not see a reason to remove the comments added by Calin. I believe it makes the code easier to read.

Comment on lines +289 to +295
varA = operatorRepresentation['A']
varB = operatorRepresentation['B']

if transA == 0:
NSize = buffA.shape[-1]
NSize = ctxt.lookup(varA).shape[-1]
else:
NSize = buffA.shape[-2]
NSize = ctxt.lookup(varA).shape[-2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants