Skip to content

Commit 4321c6d

Browse files
author
Raphael
committed
Merge everything relevant to the GEMM / MHA from snitch into my brunch
Squashed commit of the following commits: * commit 8fd7a66 - Fix tracing * commit e27b57e - sw: Use DataGen class in FlashAttention-2 and FusedConcatLinear data generators * commit cfee4e1 - sw: Add MHA kernel * commit e040704 - sw: Enable GEMM parallelized over K on subset of clusters * commit 43b8dd8 - target: Separate HAL source and build dirs * commit 16b74ea - target: Add missing RDL files to clean targets * commit 4d2b312 - flashattention_2: Fix to work on multi-cluster systems * commit 2a9536f - docs: Add system integration page * commit 5769bbd - sw: Make `snitch_cluster_cfg.h.tpl` depend only on config * commit c937cfc - docs: Add system integration guide * commit 5fcf257 - runtime: Fix global reduction with DMA * commit 1650654 - target: Streamline `SNRT_APPS` integration in derived systems * commit bc60d21 - target: Pick up CLI gentrace flags and set `--permissive` when debugging * commit 155f764 - target: Update trace visualization command after `SN_CFG` name change * commit 3720c55 - sw: Add multicast 2D tile transfer functions * commit d17d87c - sw: Enable overriding scripts directory * commit 6b75d99 - runtime: Fix CLS pointer initialization * commit 9528b4a - runtime: Fix `snrt_wake_up` with fence * commit 8d73450 - runtime: Add `snrt_fence` routine * commit 7f430f2 - Expose multiple wide TCDM ports (#258)
1 parent 9459607 commit 4321c6d

File tree

32 files changed

+1130
-392
lines changed

32 files changed

+1130
-392
lines changed

docs/ug/system_integration.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Integrating the Snitch cluster in an SoC
2+
3+
While this repository provides many IPs that can be reused independently, we suggest to integrate the Snitch cluster as a whole, that is the `snitch_cluster` module, in derived systems.
4+
5+
The `snitch_cluster` module is implemented in [snitch_cluster.sv](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/hw/snitch_cluster/src/snitch_cluster.sv).
6+
7+
## Configurability
8+
9+
A reference instantiation of the Snitch cluster can be found in the testbench used to test the cluster within this repository, see [testharness.sv](https://github.com/pulp-platform/{{ repo }}/blob/{{ branch }}/target/snitch_cluster/test/testharness.sv).
10+
11+
As you may note, we do not instantiate the `snitch_cluster` directly but a so-called `snitch_cluster_wrapper`, with a much simplified interface. All parameters of the `snitch_cluster` module are set within the wrapper.
12+
13+
The benefit of the wrapper is that it can be programmatically generated from a single source of truth, namely a JSON5 configuration file, from which the software hardware-abstraction layer (HAL), and all other sources dependent on the configuration within the repository, are also generated.
14+
15+
This way, if you want to modify the cluster configuration, you don't have to go and manually change it in multiple places (the RTL, the HAL, etc.), but only in the single-source-of-truth cluster configuration file. More information on the configuration file can be found in the [tutorial](tutorial.md#configuring-the-hardware).
16+
17+
We suggest that the same approach is used when integrating the Snitch cluster in an SoC. This allows you to easily test different configurations of the cluster inside your SoC.
18+
19+
## Integrating the RTL
20+
21+
We provide Make rules to generate the cluster wrapper and other RTL files. Include the following lines in a Makefile, to inherit Snitch's rules:
22+
```Makefile
23+
SN_ROOT = $(shell $(BENDER) path snitch_cluster)
24+
25+
include $(SN_ROOT)/target/common/common.mk
26+
include $(SN_ROOT)/target/common/rtl.mk
27+
```
28+
29+
!!! note
30+
Snitch's Makefiles require `SN_ROOT` to be defined and to point to the root of the Snitch cluster repository. You can set this however you prefer, i.e. you don't have to use Bender if you manage your dependencies in a different way.
31+
32+
You can then use the `sn-rtl` and `sn-clean-rtl` targets to respectively build and clean all of Snitch's generated RTL sources.
33+
<!-- TODO(colluca): In Picobello we explicitly use the $(SN_CLUSTER_WRAPPER) $(SN_CLUSTER_PKG) variables to build only the generated sources that depend on the cluster config. Find a common ground, probably define targets for only those files. -->
34+
35+
## Integrating the software
36+
37+
Similarly, Snitch comes with a collection of software tests and applications. These build on the functions provided by the Snitch runtime library, so they must be linked against an implementation of the latter. The runtime library abstracts away all the low-level characteristics of the system, allowing applications to be written in a mostly system-independent way, and to be portable to any multi-cluster Snitch-based system.
38+
To this end, every system must implement a hardware abstraction layer (HAL) for the Snitch runtime, which the mentioned infrastructure builds on.
39+
40+
Given a path to the platform-specific HAL sources, you can reuse the Snitch cluster's Make rules to build the runtime, tests and applications for the target platform.
41+
Include the following lines in a Makefile, to inherit Snitch's rules:
42+
43+
```Makefile
44+
SN_RUNTIME_HAL_DIR = sw/runtime/hal
45+
46+
include $(SN_ROOT)/target/common/sw.mk
47+
```
48+
49+
The included Makefile(s) can be customized to some extent by overriding some variables before the Makefile inclusion line.
50+
For example by setting `SNRT_BUILD_APPS = OFF` none of the default Snitch applications will be built.
51+
You can explicitly set the list of applications to be built via the `SNRT_APPS` variable, which can include additional system-dependent applications you may develop in the system repository. For further information on the available customization options you may want to take a look inside the recursively included Makefiles.

hw/snitch_cluster/src/snitch_cluster.sv

Lines changed: 32 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ module snitch_cluster
6868
parameter int unsigned DMAReqFifoDepth = 3,
6969
/// Number of DMA channels.
7070
parameter int unsigned DMANumChannels = 1,
71+
/// Number of exposed TCDM wide ports
72+
parameter int unsigned NumExpWideTcdmPorts = 1,
7173
/// Width of a single icache line.
7274
parameter int unsigned ICacheLineWidth [NrHives] = '{default: 0},
7375
/// Number of icache lines per set.
@@ -235,70 +237,70 @@ module snitch_cluster
235237
) (
236238
/// System clock. If `IsoCrossing` is enabled this port is the _fast_ clock.
237239
/// The slower, half-frequency clock, is derived internally.
238-
input logic clk_i,
240+
input logic clk_i,
239241
/// Asynchronous active high reset. This signal is assumed to be _async_.
240-
input logic rst_ni,
242+
input logic rst_ni,
241243
/// Per-core debug request signal. Asserting this signals puts the
242244
/// corresponding core into debug mode. This signal is assumed to be _async_.
243-
input logic [NrCores-1:0] debug_req_i,
245+
input logic [NrCores-1:0] debug_req_i,
244246
/// Machine external interrupt pending. Usually those interrupts come from a
245247
/// platform-level interrupt controller. This signal is assumed to be _async_.
246-
input logic [NrCores-1:0] meip_i,
248+
input logic [NrCores-1:0] meip_i,
247249
/// Machine timer interrupt pending. Usually those interrupts come from a
248250
/// core-local interrupt controller such as a timer/RTC. This signal is
249251
/// assumed to be _async_.
250-
input logic [NrCores-1:0] mtip_i,
252+
input logic [NrCores-1:0] mtip_i,
251253
/// Core software interrupt pending. Usually those interrupts come from
252254
/// another core to facilitate inter-processor-interrupts. This signal is
253255
/// assumed to be _async_.
254-
input logic [NrCores-1:0] msip_i,
256+
input logic [NrCores-1:0] msip_i,
255257
// External interrupt pending.
256-
input logic [NrCores-1:0] mxip_i,
258+
input logic [NrCores-1:0] mxip_i,
257259
/// First hartid of the cluster. Cores of a cluster are monotonically
258260
/// increasing without a gap, i.e., a cluster with 8 cores and a
259261
/// `hart_base_id_i` of 5 get the hartids 5 - 12.
260-
input logic [9:0] hart_base_id_i,
262+
input logic [9:0] hart_base_id_i,
261263
/// Base address of cluster. TCDM and cluster peripheral location are derived from
262264
/// it. This signal is pseudo-static.
263-
input logic [PhysicalAddrWidth-1:0] cluster_base_addr_i,
265+
input logic [PhysicalAddrWidth-1:0] cluster_base_addr_i,
264266
/// Configuration inputs for the memory cuts used in implementation.
265267
/// These signals are pseudo-static.
266-
input sram_cfgs_t sram_cfgs_i,
268+
input sram_cfgs_t sram_cfgs_i,
267269
/// Bypass half-frequency clock. (`d2` = divide-by-two). This signal is
268270
/// pseudo-static.
269-
input logic clk_d2_bypass_i,
271+
input logic clk_d2_bypass_i,
270272
/// AXI Core cluster in-port.
271-
input narrow_in_req_t narrow_in_req_i,
272-
output narrow_in_resp_t narrow_in_resp_o,
273+
input narrow_in_req_t narrow_in_req_i,
274+
output narrow_in_resp_t narrow_in_resp_o,
273275
/// AXI Core cluster out-port.
274-
output narrow_out_req_t narrow_out_req_o,
275-
input narrow_out_resp_t narrow_out_resp_i,
276+
output narrow_out_req_t narrow_out_req_o,
277+
input narrow_out_resp_t narrow_out_resp_i,
276278
/// AXI DMA cluster out-port. Usually wider than the cluster ports so that the
277279
/// DMA engine can efficiently transfer bulk of data.
278-
output wide_out_req_t wide_out_req_o,
279-
input wide_out_resp_t wide_out_resp_i,
280+
output wide_out_req_t wide_out_req_o,
281+
input wide_out_resp_t wide_out_resp_i,
280282
/// AXI DMA cluster in-port.
281-
input wide_in_req_t wide_in_req_i,
282-
output wide_in_resp_t wide_in_resp_o,
283+
input wide_in_req_t wide_in_req_i,
284+
output wide_in_resp_t wide_in_resp_o,
283285
// An additional AXI Core cluster out-port, used e.g. to connect
284286
// to the configuration interface of an external accelerator.
285287
// Compared to the `narrow_out` interface, the address space of
286288
// this port extends the cluster address space. We refer to the prior
287289
// as an external AXI plug, and to this as an externally-exposed
288290
// internal AXI plug.
289-
output narrow_out_req_t narrow_ext_req_o,
290-
input narrow_out_resp_t narrow_ext_resp_i,
291+
output narrow_out_req_t narrow_ext_req_o,
292+
input narrow_out_resp_t narrow_ext_resp_i,
291293
// External TCDM ports
292-
input tcdm_dma_req_t tcdm_ext_req_i,
293-
output tcdm_dma_rsp_t tcdm_ext_resp_o,
294+
input tcdm_dma_req_t [NumExpWideTcdmPorts-1:0] tcdm_ext_req_i,
295+
output tcdm_dma_rsp_t [NumExpWideTcdmPorts-1:0] tcdm_ext_resp_o,
294296
/// DCA IF to the FPU's
295-
input dca_router_req_t dca_8x_req_i,
296-
input logic dca_8x_req_valid_i,
297-
output logic dca_8x_req_ready_o,
297+
input dca_router_req_t dca_8x_req_i,
298+
input logic dca_8x_req_valid_i,
299+
output logic dca_8x_req_ready_o,
298300
/// DCA IF from the FPU's
299-
output dca_router_resp_t dca_8x_resp_o,
300-
output logic dca_8x_resp_valid_o,
301-
input logic dca_8x_resp_ready_i
301+
output dca_router_resp_t dca_8x_resp_o,
302+
output logic dca_8x_resp_valid_o,
303+
input logic dca_8x_resp_ready_i
302304
);
303305
// ---------
304306
// Constants
@@ -907,7 +909,7 @@ module snitch_cluster
907909
);
908910

909911
snitch_tcdm_interconnect #(
910-
.NumInp (1),
912+
.NumInp (NumExpWideTcdmPorts),
911913
.NumOut (NrSuperBanks),
912914
.NumHyperBanks (NrHyperBanks),
913915
.tcdm_req_t (tcdm_dma_req_t),

hw/snitch_cluster/src/snitch_cluster_wrapper.sv.tpl

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ ${int(getattr(c['isa_parsed'], isa))}\
2828
% endfor
2929
</%def>\
3030

31+
<%
32+
actual_num_exposed_wide_tcdm_ports = cfg['cluster']['num_exposed_wide_tcdm_ports']
33+
if actual_num_exposed_wide_tcdm_ports == 0:
34+
actual_num_exposed_wide_tcdm_ports += 1
35+
%>
36+
3137
module ${cfg['cluster']['name']}_wrapper (
3238
input logic clk_i,
3339
input logic rst_ni,
@@ -50,8 +56,8 @@ module ${cfg['cluster']['name']}_wrapper (
5056
output ${cfg['cluster']['name']}_pkg::wide_in_resp_t wide_in_resp_o,
5157
output ${cfg['cluster']['name']}_pkg::narrow_out_req_t narrow_ext_req_o,
5258
input ${cfg['cluster']['name']}_pkg::narrow_out_resp_t narrow_ext_resp_i,
53-
input ${cfg['cluster']['name']}_pkg::tcdm_dma_req_t tcdm_ext_req_i,
54-
output ${cfg['cluster']['name']}_pkg::tcdm_dma_rsp_t tcdm_ext_resp_o,
59+
input ${cfg['cluster']['name']}_pkg::tcdm_dma_req_t [${actual_num_exposed_wide_tcdm_ports}-1:0] tcdm_ext_req_i,
60+
output ${cfg['cluster']['name']}_pkg::tcdm_dma_rsp_t [${actual_num_exposed_wide_tcdm_ports}-1:0] tcdm_ext_resp_o,
5561
input ${cfg['cluster']['name']}_pkg::dca_router_req_t dca_8x_req_i,
5662
input logic dca_8x_req_valid_i,
5763
output logic dca_8x_req_ready_o,
@@ -106,6 +112,7 @@ module ${cfg['cluster']['name']}_wrapper (
106112
.DMANumAxInFlight (${cfg['cluster']['dma_axi_req_fifo_depth']}),
107113
.DMAReqFifoDepth (${cfg['cluster']['dma_req_fifo_depth']}),
108114
.DMANumChannels (${cfg['cluster']['dma_nr_channels']}),
115+
.NumExpWideTcdmPorts (${actual_num_exposed_wide_tcdm_ports}),
109116
.ICacheLineWidth (${cfg['cluster']['name']}_pkg::ICacheLineWidth),
110117
.ICacheLineCount (${cfg['cluster']['name']}_pkg::ICacheLineCount),
111118
.ICacheWays (${cfg['cluster']['name']}_pkg::ICacheWays),
@@ -217,13 +224,12 @@ module ${cfg['cluster']['name']}_wrapper (
217224
.narrow_ext_req_o (narrow_ext_req_o),
218225
.narrow_ext_resp_i (${cfg['cluster']['name']}_pkg::narrow_out_resp_t'('0)),
219226
% endif
220-
% if cfg['cluster']['wide_tcdm_port_expose']:
221-
.tcdm_ext_req_i (tcdm_ext_req_i),
222-
.tcdm_ext_resp_o (tcdm_ext_resp_o),
223-
% else:
227+
% if cfg['cluster']['num_exposed_wide_tcdm_ports']==0:
224228
.tcdm_ext_req_i (${cfg['cluster']['name']}_pkg::tcdm_dma_req_t'('0)),
225-
.tcdm_ext_resp_o (tcdm_ext_resp_o),
229+
% else:
230+
.tcdm_ext_req_i (tcdm_ext_req_i),
226231
% endif
232+
.tcdm_ext_resp_o (tcdm_ext_resp_o),
227233
.narrow_in_req_i,
228234
.narrow_in_resp_o,
229235
.narrow_out_req_o,

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ nav:
5151
- Advanced:
5252
- Trace Analysis: ug/trace_analysis.md
5353
- Code Optimization: ug/code_optimization.md
54+
- System Integration: ug/system_integration.md
5455
- Documentation: ug/documentation.md
5556
- Reference Manual:
5657
- Hardware:

sw/apps/common.mk

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
# Luca Colagrande <[email protected]>
66

77
DATA_DIR := $(realpath $(SRC_DIR)/../data)
8-
SCRIPTS_DIR := $(realpath $(SRC_DIR)/../scripts)
98

10-
$(APP)_DATA_CFG ?= $(DATA_DIR)/params.json
11-
SECTION ?=
12-
DATA_H := $($(APP)_BUILD_DIR)/data.h
13-
DATAGEN_PY = $(SCRIPTS_DIR)/datagen.py
9+
$(APP)_SCRIPT_DIR ?= $(realpath $(SRC_DIR)/../scripts)
10+
$(APP)_DATA_CFG ?= $(DATA_DIR)/params.json
11+
SECTION ?=
12+
DATA_H := $($(APP)_BUILD_DIR)/data.h
13+
DATAGEN_PY := $($(APP)_SCRIPT_DIR)/datagen.py
1414

1515
$(APP)_HEADERS := $(DATA_H)
1616
$(APP)_INCDIRS += $(dir $(DATA_H)) $(SRC_DIR)

sw/blas/gemm/src/gemm.h

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ static inline uint32_t calculate_partitioned_banks_stride(
175175
* 3. Allocates space in TCDM for local copies of matrix tiles, unless
176176
* matrix tiles are already stored in TCDM (see `load_* arguments`).
177177
* 4. Distributes tiles to clusters for parallel processing.
178-
* 5. Iterates over the tiles, performing the following:
178+
* 5. Each cluster iterates over the assigned tiles, performing the following:
179179
* - Copies data for the current tile into local memory.
180180
* - Performs the tile computation using the `sc_st_gemm` function.
181181
* - Performs a logarithmic reduction to combine partial results across
@@ -226,8 +226,15 @@ static inline int gemm(const gemm_args_t *args) {
226226
// Distribute m and k tiles to clusters
227227
uint32_t cluster_m_tiles = largs->m_tiles;
228228
uint32_t cluster_k_tiles = largs->k_tiles;
229+
uint32_t num_working_clusters = snrt_cluster_num();
229230
if (largs->parallelize_m) cluster_m_tiles /= snrt_cluster_num();
230-
if (largs->parallelize_k) cluster_k_tiles /= snrt_cluster_num();
231+
if (largs->parallelize_k) {
232+
uint32_t k_tiles_quotient = cluster_k_tiles / snrt_cluster_num();
233+
uint32_t k_tiles_remainder = cluster_k_tiles % snrt_cluster_num();
234+
cluster_k_tiles = k_tiles_quotient;
235+
if (snrt_cluster_idx() < k_tiles_remainder) cluster_k_tiles++;
236+
if (k_tiles_quotient == 0) num_working_clusters = k_tiles_remainder;
237+
}
231238

232239
// Calculate number of iterations
233240
uint32_t num_tiles = cluster_m_tiles * largs->n_tiles * cluster_k_tiles;
@@ -456,7 +463,7 @@ static inline int gemm(const gemm_args_t *args) {
456463
// Note: both compute and DMA cores participate in this step.
457464
if (largs->parallelize_k && (comp_k == (cluster_k_tiles - 1))) {
458465
snrt_global_reduction_dma(
459-
(double *)lcr, (double *)lc[c_buff_idx], tile_m * tile_n);
466+
(double *)lcr, (double *)lc[c_buff_idx], tile_m * tile_n, num_working_clusters);
460467
}
461468
}
462469

0 commit comments

Comments
 (0)