Introduce Mixture-of-Experts (MoE) MLP to the ForecastingEngine for extreme precipetation forecasting

### Is your feature request related to a problem? Please describe.

I would like to explore whether Mixture-of-Experts (MoE) layers can improve the performance of the WeatherGenerator model when predicting precipitation from IMERG and ERA5.
Currently, the model uses standard dense MLP blocks inside the ForecastingEngine and other engines. These dense MLPs must fit all dynamical regimes with a single set of parameters, which may limit their ability to capture diverse atmospheric behaviors (e.g., blocking vs zonal flow, tropical convection vs extratropics).

The idea is to replace the final MLP block in the forecasting engine (maybe later with any engine) and implement a MoE like layer which hopefully can provide an improvement on the performance with minimal changes to the code for experimentation purposes 

### Describe the solution you'd like

Introduce an MoE MLP block as a drop-in replacement for the current MLP.

It should preserve the same interface (forward(*args)) so it can be used in existing engines with minimal changes.

The MoE version will use a lightweight top-k router and multiple experts (num_experts, top_k configurable).

Each expert is a small FFN (mirroring the existing _DenseBlock), and the router determines which experts contribute per token.

Add optional tracking for per-expert usage, entropy, and load balancing to monitor training health.

Controlled via config flags (e.g. fe_mlp_type: "dense" vs "moe") so it can be enabled per engine (ForecastingEngine, Global Assimilation Engine, TargetPredictionEngine, EnsPredictionHead).

Expected benefits:

Encourage specialization by atmospheric regime, variable type, or coordinate context.

Potentially improve skill in extremes and regime persistence.

### Describe alternatives you've considered

Scaling dense MLP hidden size further, but this increases compute and memory significantly.

Using more attention layers instead of wider MLPs, but this may not target the specific regime-dependence problem.

### Additional context

First prototype was implemented inside the ForecastingEngine with minimal changes.

Future work: extend the same approach to other engines (Global Assimilation Engine, Target Prediction Engine).

Needs monitoring of routing balance (expert usage histograms, entropy, aux loss).

I want to start with small expert counts (E=4–8, top-k=2) and compare against dense baselines.

### Organisation

JSC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce Mixture-of-Experts (MoE) MLP to the ForecastingEngine for extreme precipetation forecasting #1000

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Introduce Mixture-of-Experts (MoE) MLP to the ForecastingEngine for extreme precipetation forecasting #1000

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions