Skip to content

Introduce Mixture-of-Experts (MoE) MLP to the ForecastingEngine for extreme precipetation forecasting #1000

@wael-mika

Description

@wael-mika

Is your feature request related to a problem? Please describe.

I would like to explore whether Mixture-of-Experts (MoE) layers can improve the performance of the WeatherGenerator model when predicting precipitation from IMERG and ERA5.
Currently, the model uses standard dense MLP blocks inside the ForecastingEngine and other engines. These dense MLPs must fit all dynamical regimes with a single set of parameters, which may limit their ability to capture diverse atmospheric behaviors (e.g., blocking vs zonal flow, tropical convection vs extratropics).

The idea is to replace the final MLP block in the forecasting engine (maybe later with any engine) and implement a MoE like layer which hopefully can provide an improvement on the performance with minimal changes to the code for experimentation purposes

Describe the solution you'd like

Introduce an MoE MLP block as a drop-in replacement for the current MLP.

It should preserve the same interface (forward(*args)) so it can be used in existing engines with minimal changes.

The MoE version will use a lightweight top-k router and multiple experts (num_experts, top_k configurable).

Each expert is a small FFN (mirroring the existing _DenseBlock), and the router determines which experts contribute per token.

Add optional tracking for per-expert usage, entropy, and load balancing to monitor training health.

Controlled via config flags (e.g. fe_mlp_type: "dense" vs "moe") so it can be enabled per engine (ForecastingEngine, Global Assimilation Engine, TargetPredictionEngine, EnsPredictionHead).

Expected benefits:

Encourage specialization by atmospheric regime, variable type, or coordinate context.

Potentially improve skill in extremes and regime persistence.

Describe alternatives you've considered

Scaling dense MLP hidden size further, but this increases compute and memory significantly.

Using more attention layers instead of wider MLPs, but this may not target the specific regime-dependence problem.

Additional context

First prototype was implemented inside the ForecastingEngine with minimal changes.

Future work: extend the same approach to other engines (Global Assimilation Engine, Target Prediction Engine).

Needs monitoring of routing balance (expert usage histograms, entropy, aux loss).

I want to start with small expert counts (E=4–8, top-k=2) and compare against dense baselines.

Organisation

JSC

Metadata

Metadata

Assignees

Labels

initiativeLarge piece of work covering multiple sprintmodelRelated to model training or definition (not generic infra)needs-designproj:rainaDirectly relevant and addressed as part of the RAINA project.

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions