Reference Data Management

H3 indexing

While the RDM is built on top of a PostGIS database, we more and more start to use H3 for fast geospatial lookup. H3 is a discrete global grid system, with some nice properties for cases like ours.

One example use case is creating a heatmap showing distribution of samples for specific crop types over the world.

H3 indexes can also be computed on the fly from the geometry, but a lot of operations can be made faster if they are performed on 64bit integers rather than geometries.

H3 examples

This can be achieved in different technology stacks. For instance, in (postgres)[https://github.com/zachasme/h3-pg]:

SELECT h3_lat_lng_to_cell(POINT('37.3615593,-122.0553238'), 5)

or in (Python)[https://uber.github.io/h3-py/intro.html#usage]:

h3.latlng_to_cell(lat, lng, resolution)

or other options: https://h3geo.org/docs/community/bindings

Sample stratification in the RDM

The total number of samples in the RDM is very large, but also unevenly spread geographically and in terms of crop type distribution. Hence the stratification strategy is an important factor that affects the quality of the model, but also the efficiency with which we can train models.

Therefore, the RDM needs to be able to run a certain stratification approach to select a subset of the samples. It should be possible to also consider new data in the stratification.

We propose the following database design:

a 1 to * relationship between samples and stratification selection, as there can be multiple algorithms, or stratification runs.
A ‘stratification_code’ column with the id of the stratification run that selected the sample.
A ‘stratification_timestamp’ column identifies when a sample was selected. The purpose is to identify newly added samples.

For the stratification algorithm itself, we would like to use a discrete global grid, and more specifically H3. In certain cases, this grid can also replace geospatial queries. Hence for each sample, we would like to add the H3 index at level 5.

Selected samples are exported to GeoParquet, for easy access via public url.

Table example

An example table is shown below.


import pandas as pd
pd.read_csv("sampling_table.csv")

	sample_id	stratification_flag	stratification_timestamp	extraction_id
0	my_sample	1	2024-03-01T02:00:00Z	the_id

Required API calls

based on this table, the following queries can be defined. These are cases that we would need in practice.

Retrieve point locations to extract based on: ref_id, h3 index, stratification_flag, timestamp[optional]
Retrieve list of unique stratification run timestamps (as enum in the schema returned by queryables?)
Retrieve counts per h3 cell and croptype
Update call to set extraction status in bulk

These calls can be done through an OGC Features request, or via GeoParquet. Further investigation below is performed to figure out the best approach.

Private datasets

For private reference data, we assume that we do not want to do stratification but rather use all the samples for extra training. This means that if it is made available on signed (secret) url, the processing module can retrieve the locations and extract points.

Aggregated statistics

For the generation of statistics such as counts per croptype in H3 gridcells, the OGC Features API does not seem to include support. Hence we would need a background task that updates these statistics either on a fixed time schedule, or triggered by new ingestions in the RDM.

If this background task could immediately generate a geoparquet file, then it may also be possible to avoid requiring a more advanced setup (based on a database+webservice).

Reading test from Parquet

%%time
import geopandas as gpd
import fsspec
pq_path = "https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet"
with fsspec.open(pq_path) as file:
    df = gpd.read_parquet(file,columns=["geometry","CT"])
df

CPU times: user 909 ms, sys: 305 ms, total: 1.21 s
Wall time: 17.3 s

	geometry	CT
0	MULTIPOLYGON (((-8.54796 40.56554, -8.54942 40...	1200
1	MULTIPOLYGON (((-8.52352 40.55686, -8.52352 40...	1700
2	MULTIPOLYGON (((-8.52456 40.55538, -8.52454 40...	3300
3	MULTIPOLYGON (((-8.52835 40.56835, -8.52837 40...	2000
4	MULTIPOLYGON (((-8.52781 40.57128, -8.52814 40...	0
...	...	...
99995	MULTIPOLYGON (((-6.46866 41.43669, -6.46832 41...	0
99996	MULTIPOLYGON (((-6.46797 41.43497, -6.46778 41...	0
99997	MULTIPOLYGON (((-7.45134 41.73074, -7.45134 41...	1200
99998	MULTIPOLYGON (((-6.47482 41.44217, -6.47480 41...	0
99999	MULTIPOLYGON (((-6.47503 41.44182, -6.47504 41...	9520

100000 rows × 2 columns

df.groupby(['CT']).count()

	geometry
CT
0	37673
1100	191
1200	4475
1300	271
1500	69
...	...
9300	10
9320	3
9500	8
9520	971
9920	4

61 rows × 1 columns

%%time
import duckdb
db = duckdb.connect()
db.execute('select count(*) from read_parquet("https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet")').fetchall()

CPU times: user 255 ms, sys: 4.75 ms, total: 260 ms
Wall time: 294 ms

[(100000,)]

%%time 
db.query('select CT,count(*) from read_parquet("https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet") GROUP BY CT').to_df()

CPU times: user 48.1 ms, sys: 17.3 ms, total: 65.3 ms
Wall time: 597 ms

	CT	count_star()
0	9100	21039
1	2000	1976
2	3530	153
3	7100	188
4	4380	11
...	...	...
56	1200	4475
57	9520	971
58	7300	16
59	2190	1
60	3490	4

61 rows × 2 columns

db.execute('INSTALL spatial;LOAD spatial;')

<duckdb.duckdb.DuckDBPyConnection at 0x70e94825ff30>

%%time
db.query('select ST_centroid(ST_GeomFromWKB(geometry)) from read_parquet("https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet") USING SAMPLE 100 ROWS ').to_df()

CPU times: user 4.88 s, sys: 92.6 ms, total: 4.97 s
Wall time: 11.8 s

	st_centroid(st_geomfromwkb(geometry))
0	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
1	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
2	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
3	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
4	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
...	...
95	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
96	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
97	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
98	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
99	[0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...

100 rows × 1 columns

GeoParquet vs OGC Features

As shown above, both DuckDB and GeoPandas can efficiently handle Parquet files of 100k items stored on https. With parquet as interface, data scientists can write complex queries in a language they know (Pandas, SQL, …).

When looking at OGC features, it seems there are hardly any libraries available. Some basic support in GDAL seems to be the best option to connect with it. To benefit from server side processing power, the most comfortable option seems to write CQL filters. OGC Features does not support aggregation, so a ‘group by’ operation is not supported.

Note: GeoParquet is on track to be adopted as an OGC standard, hence satisfies standardization requirements.