Training custom models

The user should be able to train models based on custom reference data. The preprocessing and feature computation approach remain the same as for standard model, but the model is simply retrained. This functionality will be offered in the form of a Python API, and supported by Jupyter notebooks, as part of the WorldCereal Toolbox component.

Model training is also performed using openEO workflows. In principle, the full workflow could work from scratch, but in practice there’s a need to store and cache intermediate results. This reduces the cost of model training when multiple iterations are needed.

The subsequent sections describe the various steps involved in model training.

Preprocessing features

Preprocessing aim is to generate a 2D data structure (a table) that can go into catboost training.

Sampling point locations

The WorldCereal extractions cache consists of 64x64 pixel timeseries stored as netCDF files. As catboost is a 1D method, we need to sample those patches at point locations.

In the approach presented in the code block below, the original algorithm is translated into an openEO process graph. It is however also possible to come up with other approaches, for instance that sample the patches at point locations, and then perform a stratification step on the larger dataset.

import  openeo
connection = openeo.connect("openeo.dataspace.copernicus.eu")
ground_truth = connection.load_stac("https://stac_catalog.com/ground_truth")

from   openeo import UDF
1sampling_udf=UDF(code="",runtime="Python")

polygons = {"type":"FeatureCollection"} #these would be the bounding boxes of the netCDF files, or in fact STAC item bboxes

ground_truth.apply_polygon(polygons,process=sampling_udf)

1: This UDF should return points as geojson

Extracting point timeseries

import  openeo
from    openeo.rest.mlmodel import MlModel
from    openeo.processes import ProcessBuilder

connection = openeo.connect("openeo.dataspace.copernicus.eu")
1l2A = connection.load_stac("https://stac_catalog.com/SENTINEL2_L2A").aggregate_temporal_period(period="month",reducer="mean")
sentinel1 = connection.load_stac("https://stac_catalog.com/SENTINEL1_BS")
bs = sentinel1.aggregate_temporal_period(period="month",reducer="mean")

timesteps_cube = l2A.merge_cubes(bs).aggregate_spatial(geometries={"type":"Point"},reducer="mean").save_result(format="Parquet")
2

timesteps_cube

1: instead of aggregate_temporal, we’ll do more advanced compositing, such as max-NDVI
2: we’ll need to add agera5 and dem bands

Training workflow

The training workflow combines feature computation starting from monthly timesteps with catboost training.

The output is a model together with STAC metadata. The link to the STAC metadata of the model can be used by an inference workflow.

from    openeo import UDF
feature_udf=UDF(code="",runtime="Python") #load UDF to compute presto features based on monthly timeseries
features_cube = connection.load_url("timesteps.parquet",format="Parquet").apply_dimension(dimension='t',process=feature_udf,target_dimension='bands')
ml_model = features_cube.process("fit_catboost_model", data=features_cube)
ml_model

Extracting private samples

For this use case, we assume that the user wants to use a private reference dataset. It should be available at a ‘secret’ url, which can be a signed url provided by the reference data module. Multiple input formats are supported by openEO next to GeoParquet, but the input data needs to be harmonized.

We immediately extract a table at point locations, assuming that the cache of intermediate patches has less value for private data.

The WorldCereal preprocessing chain is assumed to be available as an openEO User Defined Process (UDP) called worldcereal_preprocessing_udp.

sample_locations = connection.load_url("https://rdm.worldcereal.org/private_assets/absqdfjq_signed_url/private_data.parquet", format="Parquet")

connection.datacube_from_process("worldcereal_preprocessing_udp").aggregate_spatial(sample_locations,reducer="first").save_result(format="Parquet")

Training by combining private + public samples

In this usecase, the user wants to train a new model, by combining data. This should be possible by simply merging vector cubes that go into the training process.

TODO