The user should be able to train models based on custom reference data. The preprocessing and feature computation approach remain the same as for standard model, but the model is simply retrained. This functionality will be offered in the form of a Python API, and supported by Jupyter notebooks, as part of the WorldCereal Toolbox component.
Model training is also performed using openEO workflows. In principle, the full workflow could work from scratch, but in practice there’s a need to store and cache intermediate results. This reduces the cost of model training when multiple iterations are needed.
The subsequent sections describe the various steps involved in model training.
Preprocessing features
Preprocessing aim is to generate a 2D data structure (a table) that can go into catboost training.
Sampling point locations
The WorldCereal extractions cache consists of 64x64 pixel timeseries stored as netCDF files. As catboost is a 1D method, we need to sample those patches at point locations.
In the approach presented in the code block below, the original algorithm is translated into an openEO process graph. It is however also possible to come up with other approaches, for instance that sample the patches at point locations, and then perform a stratification step on the larger dataset.
import openeoconnection = openeo.connect("openeo.dataspace.copernicus.eu")ground_truth = connection.load_stac("https://stac_catalog.com/ground_truth")from openeo import UDF1sampling_udf=UDF(code="",runtime="Python")polygons = {"type":"FeatureCollection"} #these would be the bounding boxes of the netCDF files, or in fact STAC item bboxesground_truth.apply_polygon(polygons,process=sampling_udf)
instead of aggregate_temporal, we’ll do more advanced compositing, such as max-NDVI
2
we’ll need to add agera5 and dem bands
Training workflow
The training workflow combines feature computation starting from monthly timesteps with catboost training.
The output is a model together with STAC metadata. The link to the STAC metadata of the model can be used by an inference workflow.
from openeo import UDFfeature_udf=UDF(code="",runtime="Python") #load UDF to compute presto features based on monthly timeseriesfeatures_cube = connection.load_url("timesteps.parquet",format="Parquet").apply_dimension(dimension='t',process=feature_udf,target_dimension='bands')ml_model = features_cube.process("fit_catboost_model", data=features_cube)ml_model
Extracting private samples
For this use case, we assume that the user wants to use a private reference dataset. It should be available at a ‘secret’ url, which can be a signed url provided by the reference data module. Multiple input formats are supported by openEO next to GeoParquet, but the input data needs to be harmonized.
We immediately extract a table at point locations, assuming that the cache of intermediate patches has less value for private data.
The WorldCereal preprocessing chain is assumed to be available as an openEO User Defined Process (UDP) called worldcereal_preprocessing_udp.
In this usecase, the user wants to train a new model, by combining data. This should be possible by simply merging vector cubes that go into the training process.