import pandas as pd
"sampling_table.csv") pd.read_csv(
sample_id | stratification_flag | stratification_timestamp | extraction_id | |
---|---|---|---|---|
0 | my_sample | 1 | 2024-03-01T02:00:00Z | the_id |
While the RDM is built on top of a PostGIS database, we more and more start to use H3 for fast geospatial lookup. H3 is a discrete global grid system, with some nice properties for cases like ours.
One example use case is creating a heatmap showing distribution of samples for specific crop types over the world.
H3 indexes can also be computed on the fly from the geometry, but a lot of operations can be made faster if they are performed on 64bit integers rather than geometries.
This can be achieved in different technology stacks. For instance, in (postgres)[https://github.com/zachasme/h3-pg]:
SELECT h3_lat_lng_to_cell(POINT('37.3615593,-122.0553238'), 5)
or in (Python)[https://uber.github.io/h3-py/intro.html#usage]:
h3.latlng_to_cell(lat, lng, resolution)
or other options: https://h3geo.org/docs/community/bindings
The total number of samples in the RDM is very large, but also unevenly spread geographically and in terms of crop type distribution. Hence the stratification strategy is an important factor that affects the quality of the model, but also the efficiency with which we can train models.
Therefore, the RDM needs to be able to run a certain stratification approach to select a subset of the samples. It should be possible to also consider new data in the stratification.
We propose the following database design:
For the stratification algorithm itself, we would like to use a discrete global grid, and more specifically H3. In certain cases, this grid can also replace geospatial queries. Hence for each sample, we would like to add the H3 index at level 5.
Selected samples are exported to GeoParquet, for easy access via public url.
An example table is shown below.
based on this table, the following queries can be defined. These are cases that we would need in practice.
These calls can be done through an OGC Features request, or via GeoParquet. Further investigation below is performed to figure out the best approach.
For private reference data, we assume that we do not want to do stratification but rather use all the samples for extra training. This means that if it is made available on signed (secret) url, the processing module can retrieve the locations and extract points.
For the generation of statistics such as counts per croptype in H3 gridcells, the OGC Features API does not seem to include support. Hence we would need a background task that updates these statistics either on a fixed time schedule, or triggered by new ingestions in the RDM.
If this background task could immediately generate a geoparquet file, then it may also be possible to avoid requiring a more advanced setup (based on a database+webservice).
%%time
import geopandas as gpd
import fsspec
pq_path = "https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet"
with fsspec.open(pq_path) as file:
df = gpd.read_parquet(file,columns=["geometry","CT"])
df
CPU times: user 909 ms, sys: 305 ms, total: 1.21 s
Wall time: 17.3 s
geometry | CT | |
---|---|---|
0 | MULTIPOLYGON (((-8.54796 40.56554, -8.54942 40... | 1200 |
1 | MULTIPOLYGON (((-8.52352 40.55686, -8.52352 40... | 1700 |
2 | MULTIPOLYGON (((-8.52456 40.55538, -8.52454 40... | 3300 |
3 | MULTIPOLYGON (((-8.52835 40.56835, -8.52837 40... | 2000 |
4 | MULTIPOLYGON (((-8.52781 40.57128, -8.52814 40... | 0 |
... | ... | ... |
99995 | MULTIPOLYGON (((-6.46866 41.43669, -6.46832 41... | 0 |
99996 | MULTIPOLYGON (((-6.46797 41.43497, -6.46778 41... | 0 |
99997 | MULTIPOLYGON (((-7.45134 41.73074, -7.45134 41... | 1200 |
99998 | MULTIPOLYGON (((-6.47482 41.44217, -6.47480 41... | 0 |
99999 | MULTIPOLYGON (((-6.47503 41.44182, -6.47504 41... | 9520 |
100000 rows × 2 columns
geometry | |
---|---|
CT | |
0 | 37673 |
1100 | 191 |
1200 | 4475 |
1300 | 271 |
1500 | 69 |
... | ... |
9300 | 10 |
9320 | 3 |
9500 | 8 |
9520 | 971 |
9920 | 4 |
61 rows × 1 columns
%%time
import duckdb
db = duckdb.connect()
db.execute('select count(*) from read_parquet("https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet")').fetchall()
CPU times: user 255 ms, sys: 4.75 ms, total: 260 ms
Wall time: 294 ms
[(100000,)]
%%time
db.query('select CT,count(*) from read_parquet("https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet") GROUP BY CT').to_df()
CPU times: user 48.1 ms, sys: 17.3 ms, total: 65.3 ms
Wall time: 597 ms
CT | count_star() | |
---|---|---|
0 | 9100 | 21039 |
1 | 2000 | 1976 |
2 | 3530 | 153 |
3 | 7100 | 188 |
4 | 4380 | 11 |
... | ... | ... |
56 | 1200 | 4475 |
57 | 9520 | 971 |
58 | 7300 | 16 |
59 | 2190 | 1 |
60 | 3490 | 4 |
61 rows × 2 columns
%%time
db.query('select ST_centroid(ST_GeomFromWKB(geometry)) from read_parquet("https://ewocstorage.blob.core.windows.net/collections/2021_PT_EUROCROP_POLY_110.parquet") USING SAMPLE 100 ROWS ').to_df()
CPU times: user 4.88 s, sys: 92.6 ms, total: 4.97 s
Wall time: 11.8 s
st_centroid(st_geomfromwkb(geometry)) | |
---|---|
0 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
1 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
2 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
3 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
4 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
... | ... |
95 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
96 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
97 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
98 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
99 | [0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,... |
100 rows × 1 columns
As shown above, both DuckDB and GeoPandas can efficiently handle Parquet files of 100k items stored on https. With parquet as interface, data scientists can write complex queries in a language they know (Pandas, SQL, …).
When looking at OGC features, it seems there are hardly any libraries available. Some basic support in GDAL seems to be the best option to connect with it. To benefit from server side processing power, the most comfortable option seems to write CQL filters. OGC Features does not support aggregation, so a ‘group by’ operation is not supported.
Note: GeoParquet is on track to be adopted as an OGC standard, hence satisfies standardization requirements.