Model training EO data management
To effectively train models, we need very fast access to the input data at the locations where ground truth information is available. The WorldCereal classifier is primarily looking at the time-series for multiple bands per pixel. The native Sentinel earth observation data archives are not stored in a format that enables fast time series access. For instance, a Sentinel-2 product uses internal chunks of 1000x1000 pixels, while for training we need only a few 64x64 pixel chunks. Hence, a lot of unnecessary data is read from relatively slow storage when constructing such a timeseries.
A more favourable model is thus to generate files that contain the full timeseries and all bands for a given sensor. A single read operation can then load a full year worth of EO data for a location where reference data is available.
What the training step requires is analysis-ready data, which in the case of WorldCereal currently means single-pixel, cloud-free monthly composites. As there currently does not yet exist a consensus on how to best generate this analysis-ready data. The design choice was made to store the data in 2 formats:
- Raster files containing 64x64 pixel chunks of raw EO data observations, minimal cloud-screening.
- Parquet files containing analysis-ready pixel timeseries, with all input bands and the ground-truth label.
Having these two formats enables experimentation at both preprocessing level, and the level of model tuning on ARD data. If at some point the spatial context of a pixel is taken into account, this information is also available.
In the sections below, this general design is detailed further. We generally refer to the locally stored EO data as ‘the extractions’, because they are extracts from the main EO archive.
Extractions cache
First level cache
The first level cache is a collection of netCDF raster files, all with a fixed size of e.g. 64x64 pixels.
Extraction workflow steps
- Get id of extraction to run
- For extraction id get point locations from RDM
- Use UDF to convert points into 64x64 patches
First level cache requirements
- netCDF assets need to link back to the sample from which they were generated.
- a ‘Ground truth’ asset contains the raster with ground truth info, meaning the croptype code.
- Sentinel-2 asset at 10m resolution.
- Sentinel-1 asset at 20m resolution.
- AgERA5 asset
STAC extensions: - projection (proj) provides detailed info on raster size and projection system
Collection metadata
{
"description": "The Level 1 input data cache contains extracted samples of EO data. It's main use is model calibration, allowing faster iterations by providing a cache.",
"extent": {
"spatial": {
"bbox": [
[
4.053457,
51.01616,
4.129008,
51.049831
]
]
},
"temporal": {
"interval": [
[
"2020-05-01T00:00:00Z",
"2020-05-22T00:00:00Z"
]
]
}
},
"id": "L1_CACHE",
"license": "CC-BY-4.0",
"links": [],
"providers": [
{
"description": "This data was processed on an openEO backend maintained by VITO.",
"name": "VITO",
"processing:facility": "openEO Geotrellis backend",
"processing:software": {
"Geotrellis backend": "0.27.0a1"
},
"roles": [
"processor"
]
}
],
"stac_extensions": [
"https://stac-extensions.github.io/eo/v1.1.0/schema.json",
"https://stac-extensions.github.io/file/v2.1.0/schema.json",
"https://stac-extensions.github.io/processing/v1.1.0/schema.json",
"https://stac-extensions.github.io/projection/v1.1.0/schema.json"
],
"stac_version": "1.0.0",
"summaries": {
"constellation": [
"sentinel-2"
],
"instruments": [
"msi"
],
"gsd": [
10,
20,
60
],
"platform": [
"sentinel-2a",
"sentinel-2b"
]
},
"title": "WorldCereal Level 1 cache",
"type": "Collection",
"cube:dimensions": {
"x": {
"type": "spatial",
"axis": "x",
"step": 10,
"reference_system": {
"$schema": "https://proj.org/schemas/v0.2/projjson.schema.json",
"area": "World",
"bbox": {
"east_longitude": 180,
"north_latitude": 90,
"south_latitude": -90,
"west_longitude": -180
},
"coordinate_system": {
"axis": [
{
"abbreviation": "Lat",
"direction": "north",
"name": "Geodetic latitude",
"unit": "degree"
},
{
"abbreviation": "Lon",
"direction": "east",
"name": "Geodetic longitude",
"unit": "degree"
}
],
"subtype": "ellipsoidal"
},
"datum": {
"ellipsoid": {
"inverse_flattening": 298.257223563,
"name": "WGS 84",
"semi_major_axis": 6378137
},
"name": "World Geodetic System 1984",
"type": "GeodeticReferenceFrame"
},
"id": {
"authority": "OGC",
"code": "Auto42001",
"version": "1.3"
},
"name": "AUTO 42001 (Universal Transverse Mercator)",
"type": "GeodeticCRS"
}
},
"y": {
"type": "spatial",
"axis": "y",
"step": 10,
"reference_system": {
"$schema": "https://proj.org/schemas/v0.2/projjson.schema.json",
"area": "World",
"bbox": {
"east_longitude": 180,
"north_latitude": 90,
"south_latitude": -90,
"west_longitude": -180
},
"coordinate_system": {
"axis": [
{
"abbreviation": "Lat",
"direction": "north",
"name": "Geodetic latitude",
"unit": "degree"
},
{
"abbreviation": "Lon",
"direction": "east",
"name": "Geodetic longitude",
"unit": "degree"
}
],
"subtype": "ellipsoidal"
},
"datum": {
"ellipsoid": {
"inverse_flattening": 298.257223563,
"name": "WGS 84",
"semi_major_axis": 6378137
},
"name": "World Geodetic System 1984",
"type": "GeodeticReferenceFrame"
},
"id": {
"authority": "OGC",
"code": "Auto42001",
"version": "1.3"
},
"name": "AUTO 42001 (Universal Transverse Mercator)",
"type": "GeodeticCRS"
}
},
"time": {
"type": "temporal",
"extent": [
"2015-06-23T00:00:00Z",
"2019-07-10T13:44:56Z"
],
"step": "P5D"
},
"spectral": {
"type": "bands",
"values": [
"SCL",
"B01",
"B02",
"B03",
"B04",
"B05",
"B06",
"B07",
"B08",
"B8A",
"B09",
"B10",
"B11",
"B12",
"CROPTYPE"
]
}
},
"item_assets": {
"sentinel2": {
"gsd": 10,
"title": "Sentinel2",
"description": "Sentinel-2 bands",
"type": "application/x-netcdf",
"roles": [
"data"
],
"proj:shape": [
64,
64
],
"raster:bands": [
{
"name": "B01"
},
{
"name": "B02"
}
],
"cube:variables": {
"B01": {"dimensions": ["time","y","x"],"type": "data"},
"B02": {"dimensions": ["time","y","x"],"type": "data"},
"B03": {"dimensions": ["time","y","x"],"type": "data"},
"B04": {"dimensions": ["time","y","x"],"type": "data"},
"B05": {"dimensions": ["time","y","x"],"type": "data"},
"B06": {"dimensions": ["time","y","x"],"type": "data"},
"B07": {"dimensions": ["time","y","x"],"type": "data"},
"B8A": {"dimensions": ["time","y","x"],"type": "data"},
"B08": {"dimensions": ["time","y","x"],"type": "data"},
"B11": {"dimensions": ["time","y","x"],"type": "data"},
"B12": {"dimensions": ["time","y","x"],"type": "data"},
"SCL": {"dimensions": ["time","y","x"],"type": "data"}
},
"eo:bands": [
{
"name": "B01",
"common_name": "coastal",
"center_wavelength": 0.443,
"full_width_half_max": 0.027
},
{
"name": "B02",
"common_name": "blue",
"center_wavelength": 0.49,
"full_width_half_max": 0.098
},
{
"name": "B03",
"common_name": "green",
"center_wavelength": 0.56,
"full_width_half_max": 0.045
},
{
"name": "B04",
"common_name": "red",
"center_wavelength": 0.665,
"full_width_half_max": 0.038
},
{
"name": "B05",
"common_name": "rededge",
"center_wavelength": 0.704,
"full_width_half_max": 0.019
},
{
"name": "B06",
"common_name": "rededge",
"center_wavelength": 0.74,
"full_width_half_max": 0.018
},
{
"name": "B07",
"common_name": "rededge",
"center_wavelength": 0.783,
"full_width_half_max": 0.028
},
{
"name": "B08",
"common_name": "nir",
"center_wavelength": 0.842,
"full_width_half_max": 0.145
},
{
"name": "B8A",
"common_name": "nir08",
"center_wavelength": 0.865,
"full_width_half_max": 0.033
},
{
"name": "B11",
"common_name": "swir16",
"center_wavelength": 1.61,
"full_width_half_max": 0.143
},
{
"name": "B12",
"common_name": "swir22",
"center_wavelength": 2.19,
"full_width_half_max": 0.242
}
]
},
"auxiliary": {
"title": "ground truth data",
"description": "This asset contains the crop type codes.",
"type": "application/x-netcdf",
"roles": [
"data"
],
"proj:shape": [
64,
64
],
"raster:bands": [
{
"name": "CROPTYPE",
"data_type": "uint16",
"bits_per_sample": 16
}
]
},
"sentinel1": {},
"agera5": {}
}
}
Cache updates
The RDM needs to be queried for new collections on a regular basis, to discover new collections. Whenever a new collection is available in the RDM, we want the extraction workflow to automatically update the cache, allowing users to train models efficiently on all data available in the RDM.
Option 1: Nifi
http fetch collections -> DetectDuplicate/DeduplicateRecord for fast duplicate dropping -> LookupRecord to check if we already know about the collection
Query SQL
New collections become flow files
Per new collection, do job splitting.
Continuously run job job manager on job splits.
Monitoring: NiFi processors to send mail
Option 2: Kubernetes cron job
Kubernetes can schedule cron job, allowing to run Python script on a daily basis, detecting new collections.
Monitoring: alertmanager
Feature required: job manager write to Parquet on S3? Feature GFMap: write to workspace Or User Workspace with http access? Or as ‘upcscaling service’ pod in k8s?
Dashboard: