Comparing neural network architectures with Py4cast & Titan

We present here a report on our first campaign of experiments to build a weather forecasting model with Deep Learning.

Our first goal was to train and compare several neural network architectures and assess their differences, all things being equal.

Setup

Dataset: TITAN

Source: only AROME Analyses (no data from Arpege coupling model)
Resolution: 2.5km
Historical Data: 2021-2023 (training: 2021-2022, testing: 2023)
Time Step: 1 Hour
21 Weather Parameters: Input and output of the model
- 5 Surface Variables: Temperature, humidity, wind (u & v), and precipitation
- 4 Variables at 4 Vertical Levels:
  - 850, 700, 500, 250 hPa
  - T, U, V, Z
4 Forcing Fields: Model inputs
- Cosine and sine of the time of day
- Cosine and sine of the day of the year
Training Samples: ~16,000 pairs (t0, t+1)

Note: Precipitation is the only parameter that is not an analysis. It is an AROME forecast made every hour, predicting the cumulative precipitation in mm for the next hour. In the future, we aim to use higher quality expertized radar data.

Training Methodology

Training strategy: Trained to make 1-hour forecasts, in scaled steps: y_pred = x + f(x) * step_diff_std + step_diff_mean
Cost Function: Weighted Mean Squared Error (MSE)
Models: UNetR++, SwinUNetR, Hilam
Learning Rate Scheduler: TODO

Hardware

Training: Conducted on 1 node of 4 Nvidia V100 32GB GPUs
Training Duration: 2 to 15 days depending on the models
Inference Time: Less than one minute on CPU for +12h forecasts

Results

Both in metrics and forecast case studies, the UNetR++ model, modified and optimized, proved to be the best among the tested models.

The HiLAM model, on the other hand, was costly to train, with lower scores and lower resolution forecasts.

Table : temps training, commentaires (config + lien vers fichier), prévisions lisses, loss finale sur jeu de test, RAM utilisée, batch size

Model	Configuration	Training Time	Batch Size	Used RAM
HiLAM	details + config file	5 days	Ground	Cumulated Rainfall on next 1h
HiLAM	128 details + config file	8 days	10m	U, V
SwinUNetR	ARPEGE	EURAT01 (0.1°)	2m	T, HU
UNetR++	ARPEGE	EURAT01 (0.1°)	24 Isobaric levels	Z, T, U, V, HU
UNetR++	ARPEGE	EURAT01 (0.1°)	Sea	P
UNetR++	AROME	EURW1S100 (1.3 km)	10m	U, V

Scores

TODO : add graph with RMSE per time step for a few surface and atmo parameters

Animations

Here we present some forecast animations of these two models, compared to AROME and AROME analyses (our “ground truth” here) on a case study for 5 surface parameters.

On 2023/06/18 at 12h UTC, a warm and unstable southwesterly flow favored the development of strong thunderstorms over parts of France on June 18. The north of the country was swept by wind gusts between 90 and 110 km/h. Some supercells formed over the center and then the southwest of the country, sometimes generating heavy hail.

INSERT GIFS

Analysis & Perspectives

Initial experiments show that it is possible to train a neural network model to provide forecasts at the scale of France.
The obtained models provide consistent forecasts for the studied cases but become extravagant beyond a 12-hour lead time.
These initial experiments highlight the importance of not being confined to assumptions derived from physical models and exploring all the possibilities offered by neural networks. We see here that interesting forecasts can already be obtained with a model trained on few parameters and vertical levels, with little historical data. Similarly, Graph Neural Networks, which were thought to be suitable for weather forecasting, are costly and (in our case) less effective than Vision Transformers.
The py4cast development framework proves useful for quickly conducting experiment campaigns and testing new ideas.
Future priorities include:
- Testing other architectures: GraphCast, Pangu, others (in collaboration with Eviden)
- Expanding the Titan dataset: temporal depth, other parameters…
- Testing the influence of the number of time steps in input and output, adding a large-scale forcing model in input, testing variability between two trainings…
- Testing the influence of the duration of a time step (like Panguweather) on the quality of forecasts depending on the lead time (24H, 48H, …)
- Conducting optimization work on models and training strategies to remain as resource-efficient as possible