-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Is your feature request related to a problem? Please describe.
I would like to explore whether Mixture-of-Experts (MoE) layers can improve the performance of the WeatherGenerator model when predicting precipitation from IMERG and ERA5.
Currently, the model uses standard dense MLP blocks inside the ForecastingEngine and other engines. These dense MLPs must fit all dynamical regimes with a single set of parameters, which may limit their ability to capture diverse atmospheric behaviors (e.g., blocking vs zonal flow, tropical convection vs extratropics).
The idea is to replace the final MLP block in the forecasting engine (maybe later with any engine) and implement a MoE like layer which hopefully can provide an improvement on the performance with minimal changes to the code for experimentation purposes
Describe the solution you'd like
Introduce an MoE MLP block as a drop-in replacement for the current MLP.
It should preserve the same interface (forward(*args)) so it can be used in existing engines with minimal changes.
The MoE version will use a lightweight top-k router and multiple experts (num_experts, top_k configurable).
Each expert is a small FFN (mirroring the existing _DenseBlock), and the router determines which experts contribute per token.
Add optional tracking for per-expert usage, entropy, and load balancing to monitor training health.
Controlled via config flags (e.g. fe_mlp_type: "dense" vs "moe") so it can be enabled per engine (ForecastingEngine, Global Assimilation Engine, TargetPredictionEngine, EnsPredictionHead).
Expected benefits:
Encourage specialization by atmospheric regime, variable type, or coordinate context.
Potentially improve skill in extremes and regime persistence.
Describe alternatives you've considered
Scaling dense MLP hidden size further, but this increases compute and memory significantly.
Using more attention layers instead of wider MLPs, but this may not target the specific regime-dependence problem.
Additional context
First prototype was implemented inside the ForecastingEngine with minimal changes.
Future work: extend the same approach to other engines (Global Assimilation Engine, Target Prediction Engine).
Needs monitoring of routing balance (expert usage histograms, entropy, aux loss).
I want to start with small expert counts (E=4–8, top-k=2) and compare against dense baselines.
Organisation
JSC
Metadata
Metadata
Assignees
Labels
Type
Projects
Status