How To: Land-Use-Land-Cover Prediction for Slovenia

This notebook shows the steps towards constructing a machine learning pipeline for predicting the land use and land cover for the region of Republic of Slovenia. We will use satellite images obtained by ESA’s Sentinel-2 to train a model and use it for prediction. The example will lead you through the whole process of creating the pipeline, with details provided at each step.

Before you start


In order to run the example you’ll need a Sentinel Hub account. If you do not have one yet, you can create a free trial account at Sentinel Hub webpage. If you are a researcher you can even apply for a free non-commercial account at ESA OSEO page.

Once you have the account set up, login to Sentinel Hub Configurator. By default you will already have the default configuration with an instance ID (alpha-numeric code of length 36). For this tutorial we recommend that you create a new configuration ("Add new configuration") and set the configuration to be based on Python scripts template. Such configuration will already contain all layers used in these examples. Otherwise you will have to define the layers for your configuration yourself.

After you have prepared a configuration please put configuration’s instance ID into sentinelhub package’s configuration file following the configuration instructions. For Processing API request you also need to obtain and set your oauth client id and secret. More info here.


Part 1:

  1. Define the Area-of-Interest (AOI):

  • Obtain the outline of Slovenia (provided)

  • Split into manageable smaller tiles

  • Select a small 5x5 area for classification

  1. Use the integrated sentinelhub-py package in order to fill the EOPatches with some content (band data, cloud masks, …)

  • Define the time interval (this example uses the whole year of 2019)

  1. Add additional information from band combinations (norm. vegetation index - NDVI, norm. water index - NDWI)

  2. Add a reference map (provided)

  • Convert provided vector data to raster and add it to EOPatches

Part 2:

  1. Prepare the training data

  • Remove too cloudy scenes

  • Perform temporal interpolation (filling gaps and resampling to the same dates)

  • Apply erosion

  • Random spatial sampling of the EOPatches

  • Split patches for training/validation

  1. Construct and train the ML model

  • Make the prediction for each patch

  1. Validate the model

  2. Visualise the results

Let’s start!

# Firstly, some necessary imports

# Jupyter notebook related
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Built-in modules
import pickle
import sys
import os
import datetime
import itertools
from aenum import MultiValueEnum

# Basics of Python data handling and visualization
import numpy as np
import geopandas as gpd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.colors import ListedColormap, BoundaryNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from shapely.geometry import Polygon
from import tqdm

# Machine learning
import lightgbm as lgb
from sklearn.externals import joblib
from sklearn import metrics
from sklearn import preprocessing

# Imports from eo-learn and sentinelhub-py
from eolearn.core import EOTask, EOPatch, LinearWorkflow, FeatureType, OverwritePermission, \
    LoadTask, SaveTask, EOExecutor, ExtractBandsTask, MergeFeatureTask
from import SentinelHubInputTask, ExportToTiff
from eolearn.mask import AddMultiCloudMaskTask, AddValidDataMaskTask
from eolearn.geometry import VectorToRaster, PointSamplingTask, ErosionTask
from eolearn.features import LinearInterpolation, SimpleFilterTask, NormalizedDifferenceIndexTask
from sentinelhub import UtmZoneSplitter, BBox, CRS, DataSource
/home/ubuntu/.pyenv/versions/3.7.6/envs/base/lib/python3.7/site-packages/sklearn/externals/joblib/ FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)

Part 1

1. Define the Area-of-Interest (AOI):

  • A geographical shape of Slovenia was taken from Natural Earth database and a buffer of 500 m was applied. The shape is available in repository: example_data/svn_border.geojson

  • Convert it to selected CRS: taken to be the CRS of central UTM tile (UTM_33N)

  • Split it into smaller, manageable, non-overlapping rectangular tiles

  • Run classification on a selected 5x5 area

Be sure that your choice of CRS is the same as the CRS of your reference data.

In the case that you are having problems with empty data being downloaded, try changing the CRS to something that suits the location of the AOI better.

Get country boundary

# Folder where data for running the notebook is stored
DATA_FOLDER = os.path.join('..', '..', 'example_data')

# Load geojson file
country = gpd.read_file(os.path.join(DATA_FOLDER, 'svn_border.geojson'))
country = country.buffer(500)

# Get the country's shape in polygon format
country_shape = country.geometry.values[-1]

# Plot country

# Print size
print('Dimension of the area is {0:.0f} x {1:.0f} m2'.format(country_shape.bounds[2] - country_shape.bounds[0],
                                                             country_shape.bounds[3] - country_shape.bounds[1]))
Dimension of the area is 243184 x 161584 m2

Split to smaller tiles and choose a 5x5 area

The splitting choice depends on the available resources of your computer. An EOPatch with a size of has around 500 x 500 pixels at 10 meter resolution has a size ob about ~1 GB.

# Create the splitter to obtain a list of bboxes
bbox_splitter = UtmZoneSplitter([country_shape],, 5000)

bbox_list = np.array(bbox_splitter.get_bbox_list())
info_list = np.array(bbox_splitter.get_info_list())

# Prepare info of selected EOPatches
geometry = [Polygon(bbox.get_polygon()) for bbox in bbox_list]
idxs = [info['index'] for info in info_list]
idxs_x = [info['index_x'] for info in info_list]
idxs_y = [info['index_y'] for info in info_list]

gdf = gpd.GeoDataFrame({'index': idxs, 'index_x': idxs_x, 'index_y': idxs_y},

# select a 5x5 area (id of center patch)
ID = 616

# Obtain surrounding 5x5 patches
patchIDs = []
for idx, [bbox, info] in enumerate(zip(bbox_list, info_list)):
    if (abs(info['index_x'] - info_list[ID]['index_x']) <= 2 and
        abs(info['index_y'] - info_list[ID]['index_y']) <= 2):

# Check if final size is 5x5
if len(patchIDs) != 5*5:
    print('Warning! Use a different central patch ID, this one is on the border.')

# Change the order of the patches (used for plotting later)
patchIDs = np.transpose(np.fliplr(np.array(patchIDs).reshape(5, 5))).ravel()

# save to shapefile
shapefile_name = './grid_slovenia_500x500.gpkg'
gdf.to_file(shapefile_name, driver='GPKG')

Visualize the selection

# figure
fig, ax = plt.subplots(figsize=(30, 30))
country.plot(ax=ax, facecolor='w',edgecolor='b',alpha=0.5)
ax.set_title('Selected 5x5  tiles from Slovenia', fontsize=25);
for bbox, info in zip(bbox_list, info_list):
    geo = bbox.geometry
    ax.text(geo.centroid.x, geo.centroid.y, info['index'], ha='center', va='center')



2. - 4. Fill EOPatches with data:

Now it’s time to create EOPatches and fill them with Sentinel-2 data using Sentinel Hub services. We will add the following data to each EOPatch:

  • L1C custom list of bands [B02, B03, B04, B08, B11, B12], which corresponds to [B, G, R, NIR, SWIR1, SWIR2] wavelengths.

  • SentinelHub’s cloud mask

Additionally, we will add:

  • Calculated NDVI, NDWI, and NDBI information

  • A mask of validity, based on acquired data from Sentinel and cloud coverage. Valid pixel is if:

    1. IS_DATA == True

    2. CLOUD_MASK == 0 (1 indicates cloudy pixels and 255 indicates NO_DATA)

An EOPatch is created and manipulated using EOTasks, which are chained in an EOWorkflow. In this example the final workflow is executed on all patches, which are saved to the specified directory.

Define some needed custom EOTasks

class SentinelHubValidData:
    Combine Sen2Cor's classification map with `IS_DATA` to define a `VALID_DATA_SH` mask
    The SentinelHub's cloud mask is asumed to be found in eopatch.mask['CLM']
    def __call__(self, eopatch):
        return np.logical_and(eopatch.mask['IS_DATA'].astype(np.bool),

class CountValid(EOTask):
    The task counts number of valid observations in time-series and stores the results in the timeless mask.
    def __init__(self, count_what, feature_name):
        self.what = count_what = feature_name

    def execute(self, eopatch):
        eopatch.add_feature(FeatureType.MASK_TIMELESS,, np.count_nonzero(eopatch.mask[self.what],axis=0))

        return eopatch

Define the workflow tasks

# add a request for S2 bands
# Here we also do a simple filter of cloudy scenes (on tile level)
# s2cloudless masks and probabilities are requested via additional data
band_names = ['B02', 'B03', 'B04', 'B08', 'B11', 'B12']
add_data = SentinelHubInputTask(
    bands_feature=(FeatureType.DATA, 'BANDS'),
    bands = band_names,
    additional_data=[(FeatureType.MASK, 'dataMask', 'IS_DATA'),
                     (FeatureType.MASK, 'CLM'),
                     (FeatureType.DATA, 'CLP')],

# NDVI: (B08 - B04)/(B08 + B04)
# NDWI: (B03 - B08)/(B03 + B08)
# NDBI: (B11 - B08)/(B11 + B08)
ndvi = NormalizedDifferenceIndexTask((FeatureType.DATA, 'BANDS'), (FeatureType.DATA, 'NDVI'),
                                     [band_names.index('B08'), band_names.index('B04')])
ndwi = NormalizedDifferenceIndexTask((FeatureType.DATA, 'BANDS'), (FeatureType.DATA, 'NDWI'),
                                     [band_names.index('B03'), band_names.index('B08')])
ndbi = NormalizedDifferenceIndexTask((FeatureType.DATA, 'BANDS'), (FeatureType.DATA, 'NDBI'),
                                     [band_names.index('B11'), band_names.index('B08')])

# validate pixels using SentinelHub's cloud detection mask and region of acquisition
add_sh_valmask = AddValidDataMaskTask(SentinelHubValidData(),
                                      'IS_VALID' # name of output mask

# count number of valid observations per pixel using valid data mask
count_val_sh = CountValid('IS_VALID', # name of existing mask
                          'VALID_COUNT' # name of output scalar

path_out = './eopatches/'
if not os.path.isdir(path_out):
save = SaveTask(path_out, overwrite_permission=OverwritePermission.OVERWRITE_PATCH)

Help! I prefer to calculate cloud masks of my own!

If you wish to calculate s2cloudless masks and probabilities (almost) from scratch, you can do this by using the following two EOTasks instead of the first one above

band_names = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B10', 'B11', 'B12']
add_data = SentinelHubInputTask(
    bands_feature=(FeatureType.DATA, 'BANDS'),
    bands = band_names,
    additional_data=[(FeatureType.MASK, 'dataMask', 'IS_DATA')],

add_clm = AddMultiCloudMaskTask(data_feature='BANDS',
                                mono_features=('CLP', 'CLM'),

Reference map task

For this example, a subset of the country-wide reference for land-use-land-cover is provided. It is available in the form of a geopackage, which contains polygons and their corresponding labels. The labels represent the following 10 classes:

  • lulcid = 0, name = no data

  • lulcid = 1, name = cultivated land

  • lulcid = 2, name = forest

  • lulcid = 3, name = grassland

  • lulcid = 4, name = shrubland

  • lulcid = 5, name = water

  • lulcid = 6, name = wetlands

  • lulcid = 7, name = tundra

  • lulcid = 8, name = artificial surface

  • lulcid = 9, name = bareland

  • lulcid = 10, name = snow and ice

We have defined a land cover enum class for ease of use below.

class LULC(MultiValueEnum):
    """ Enum class containing basic LULC types
    NO_DATA            = 'No Data',            0,  '#ffffff'
    CULTIVATED_LAND    = 'Cultivated Land',    1,  '#ffff00'
    FOREST             = 'Forest',             2,  '#054907'
    GRASSLAND          = 'Grassland',          3,  '#ffa500'
    SHRUBLAND          = 'Shrubland',          4,  '#806000'
    WATER              = 'Water',              5,  '#069af3'
    WETLAND            = 'Wetlands',           6,  '#95d0fc'
    TUNDRA             = 'Tundra',             7,  '#967bb6'
    ARTIFICIAL_SURFACE = 'Artificial Surface', 8,  '#dc143c'
    BARELAND           = 'Bareland',           9,  '#a6a6a6'
    SNOW_AND_ICE       = 'Snow and Ice',       10, '#000000'

    def id(self):
        """ Returns an ID of an enum type

        :return: An ID
        :rtype: int
        return self.values[1]

    def color(self):
        """ Returns class color

        :return: A color in hexadecimal representation
        :rtype: str
        return self.values[2]

def get_bounds_from_ids(ids):
    bounds = []
    for i in range(len(ids)):
        if i < len(ids) - 1:
            if i == 0:
                diff = (ids[i + 1] - ids[i]) / 2
                bounds.append(ids[i] - diff)
            diff = (ids[i + 1] - ids[i]) / 2
            bounds.append(ids[i] + diff)
            diff = (ids[i] - ids[i - 1]) / 2
            bounds.append(ids[i] + diff)
    return bounds

# Reference colormap things
lulc_bounds = get_bounds_from_ids([ for x in LULC])
lulc_cmap = ListedColormap([x.color for x in LULC], name="lulc_cmap")
lulc_norm = BoundaryNorm(lulc_bounds, lulc_cmap.N)

The main point of this task is to create a raster mask from the vector polygons and add it to the eopatch. With this procedure, any kind of a labeled shapefile can be transformed into a raster reference map. This result is achieved with the existing task VectorToRaster from the eolearn.geometry package. All polygons belonging to the each of the classes are separately burned to the raster mask.

Land use data are public in Slovenia, you can use the provided partial dataset for this example, or download the full dataset (if you want to upscale the project) from our bucket. The datasets have already been pre-processed for the purposes of the example.

# takes some time due to the large size of the reference data
land_use_ref_path = os.path.join(DATA_FOLDER, 'land_use_10class_reference_slovenia_partial.gpkg')
land_use_ref = gpd.read_file(land_use_ref_path)

rasterization_task = VectorToRaster(land_use_ref, (FeatureType.MASK_TIMELESS, 'LULC'),
                                    values_column='lulcid', raster_shape=(FeatureType.MASK, 'IS_DATA'),

Define the workflow

All the tasks that were defined so far create and fill the EOPatches. The tasks need to be put in some order and executed one by one. This can be achieved by manually executing the tasks, or more conveniently, defining an EOWorkflow which does this for you.

The following workflow is created and executed:

  1. Create EOPatches with band and cloud data

  2. Calculate and add NDVI, NDWI, NORM

  3. Add mask of valid pixels

  4. Add scalar feature representing the count of valid pixels

  5. Save eopatches

An EOWorkflow can be linear or more complex, but it should be acyclic. Here we will use the linear case of the EOWorkflow, available as LinearWorkflow

Define the workflow

# Define the workflow
workflow = LinearWorkflow(

# Let's visualize it

This may take some time, so go grab a cup of coffee …


# Execute the workflow
time_interval = ['2019-01-01', '2019-12-31'] # time interval for the SH request

# define additional parameters of the workflow
execution_args = []
for idx, bbox in enumerate(bbox_list[patchIDs]):
        add_data:{'bbox': bbox, 'time_interval': time_interval},
        save: {'eopatch_folder': f'eopatch_{idx}'}

executor = EOExecutor(workflow, execution_args, save_logs=True), multiprocess=True)

/home/ubuntu/eo-learn/features/eolearn/features/ RuntimeWarning: invalid value encountered in true_divide
  ndi = (band_a - band_b + self.acorvi_constant) / (band_a + band_b + self.acorvi_constant)

CPU times: user 4min 22s, sys: 1min 24s, total: 5min 47s
Wall time: 3min 36s

Visualize the patches

Let’s load a single EOPatch and look at the structure. By executing


We obtain the following structure:

  data: {
    BANDS: numpy.ndarray(shape=(48, 500, 500, 6), dtype=float32)
    CLP: numpy.ndarray(shape=(48, 500, 500, 1), dtype=uint8)
    NDBI: numpy.ndarray(shape=(48, 500, 500, 1), dtype=float32)
    NDVI: numpy.ndarray(shape=(48, 500, 500, 1), dtype=float32)
    NDWI: numpy.ndarray(shape=(48, 500, 500, 1), dtype=float32)
  mask: {
    CLM: numpy.ndarray(shape=(48, 500, 500, 1), dtype=uint8)
    IS_DATA: numpy.ndarray(shape=(48, 500, 500, 1), dtype=uint8)
    IS_VALID: numpy.ndarray(shape=(48, 500, 500, 1), dtype=bool)
  scalar: {}
  label: {}
  vector: {}
  data_timeless: {}
  mask_timeless: {
    LULC: numpy.ndarray(shape=(500, 500, 1), dtype=uint8)
    VALID_COUNT: numpy.ndarray(shape=(500, 500, 1), dtype=int64)
  scalar_timeless: {}
  label_timeless: {}
  vector_timeless: {}
  meta_info: {
    maxcc: 0.8
    service_type: 'processing'
    size_x: 500
    size_y: 500
    time_difference: datetime.timedelta(seconds=7200)
    time_interval: ('2019-01-01T00:00:00', '2019-12-31T23:59:59')
  bbox: BBox(((500000.0, 5135000.0), (505000.0, 5140000.0)), crs=CRS('32633'))
  timestamp: [datetime.datetime(2019, 1, 1, 10, 7, 42), ..., datetime.datetime(2019, 12, 7, 10, 7, 45)], length=48

It is possible to then access various EOPatch content via calls like:

eopatch.mask['LULC']['NDVI'][0]['BANDS'][5][..., [3, 2, 1]]

Due to the maxcc filtering, not all patches have the same amount of timestamps.

Let’s select a date and draw the closest timestamp for each eopatch.

# Draw the RGB image
path_out = './eopatches'
fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 20))

date = datetime.datetime(2019,7,1)

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    dates = np.array(eopatch.timestamp)
    closest_date_id = np.argsort(abs(date-dates))[0]
    ax.imshow(np.clip(['BANDS'][closest_date_id][..., [2, 1, 0]] * 3.5, 0, 1))
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)


Visualize the reference map

path_out = './eopatches'

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 25))

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    im = ax.imshow(eopatch.mask_timeless['LULC'].squeeze(), cmap=lulc_cmap, norm=lulc_norm)
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)

cb = fig.colorbar(im, ax=axs.ravel().tolist(), orientation='horizontal', pad=0.01, aspect=100)
cb.set_ticks([ for entry in LULC])[ for entry in LULC], rotation=45, fontsize=15)


Plot the map of valid pixel counts

path_out = './eopatches'

vmin, vmax = None, None
for i in range(len(patchIDs)):
    eopatch = EOPatch.load(f'{path_out}/eopatch_{i}', lazy_loading=True)
    data = eopatch.mask_timeless['VALID_COUNT'].squeeze()
    vmin = np.min(data) if vmin is None else (np.min(data) if np.min(data) < vmin else vmin)
    vmax = np.max(data) if vmax is None else (np.max(data) if np.max(data) > vmax else vmax)

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 25))

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    im = ax.imshow(eopatch.mask_timeless['VALID_COUNT'].squeeze(), vmin=vmin, vmax=vmax,
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)

cb = fig.colorbar(im, ax=axs.ravel().tolist(), orientation='horizontal', pad=0.01, aspect=100)


Spatial mean of NDVI

Plot the mean of NDVI over all pixels in a selected patch throughout the year. Filter out clouds in the mean calculation.

path_out = './eopatches'

eID = 16
eopatch = EOPatch.load(f'{path_out}/eopatch_{eID}', lazy_loading=True)

ndvi =['NDVI'] # ndvi data cube
mask = eopatch.mask['IS_VALID'] # mask of valid pixels
time = np.array(eopatch.timestamp) # x axis
t, w, h, _ = ndvi.shape

ndvi_clean = ndvi.copy()
ndvi_clean[~mask] = np.nan # set values of invalid pixels to NaN's

# Calculate means, remove NaN's from means
ndvi_mean = np.nanmean(ndvi.reshape(t, w * h).squeeze(), axis=1)
ndvi_mean_clean = np.nanmean(ndvi_clean.reshape(t, w * h).squeeze(), axis=1)
time_clean = time[~np.isnan(ndvi_mean_clean)]
ndvi_mean_clean = ndvi_mean_clean[~np.isnan(ndvi_mean_clean)]

fig = plt.figure(figsize=(20,5))
plt.plot(time_clean, ndvi_mean_clean, 's-', label = 'Mean NDVI with cloud cleaning')
plt.plot(time, ndvi_mean, 'o-', label='Mean NDVI without cloud cleaning')
plt.xlabel('Time', fontsize=15)
plt.ylabel('Mean NDVI over patch', fontsize=15)

plt.legend(loc=2, prop={'size': 15});
/home/ubuntu/.pyenv/versions/3.7.6/envs/base/lib/python3.7/site-packages/ RuntimeWarning: Mean of empty slice

Temporal mean of NDVI

Plot the time-wise mean of NDVI for the whole region. Filter out clouds in the mean calculation.

path_out = './eopatches'

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 25))

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    ndvi =['NDVI']
    mask = eopatch.mask['IS_VALID']
    ndvi[~mask] = np.nan
    ndvi_mean = np.nanmean(ndvi, axis=0).squeeze()
    im = ax.imshow(ndvi_mean, vmin=0, vmax=0.8, cmap=plt.get_cmap('YlGn'))
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)

cb = fig.colorbar(im, ax=axs.ravel().tolist(), orientation='horizontal', pad=0.01, aspect=100)


Plot the average cloud probability

Plot te average of the cloud probability for each pixel, take the cloud mask into account.

Some structures can be seen like road networks etc., indicating a bias of the cloud detector towards these objects.

path_out = './eopatches'

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 25))

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    clp =['CLP'].astype(float)/255
    mask = eopatch.mask['IS_VALID']
    clp[~mask] = np.nan
    clp_mean = np.nanmean(clp, axis=0).squeeze()
    im = ax.imshow(clp_mean, vmin=0.0, vmax=0.3,
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)

cb = fig.colorbar(im, ax=axs.ravel().tolist(), orientation='horizontal', pad=0.01, aspect=100)


Part 2

5. Prepare the training data

We will create a new workflow that processes the data:

  1. Remove too cloudy scenes

  • Check the ratio of the valid data for each patch and for each time frame

  • Keep only time frames with > 80 % valid coverage (no clouds)

  1. Concatenate BAND, NDVI, NDWI, NDBI info into a single feature called FEATURES

  2. Perform temporal interpolation (filling gaps and resampling to the same dates)

  • Create a task for linear interpolation in the temporal dimension

  • Provide the cloud mask to tell the interpolating function which values to update

  1. Perform erosion

  • This removes artefacts with a width of 1 px, and also removes the edges between polygons of different classes

  1. Random spatial sampling of the EOPatches

  • Randomly take a subset of pixels from a patch to use in the machine learning training

  1. Split patches for training/validation

  • Split the patches into a training and validation set

Define EOTasks

class ValidDataFractionPredicate:
    """ Predicate that defines if a frame from EOPatch's time-series is valid or not. Frame is valid, if the
    valid data fraction is above the specified threshold.
    def __init__(self, threshold):
        self.threshold = threshold

    def __call__(self, array):
        coverage = np.sum(array.astype(np.uint8)) /
        return coverage > self.threshold
load = LoadTask(path_out)

concatenate = MergeFeatureTask({FeatureType.DATA: ['BANDS', 'NDVI', 'NDWI', 'NDBI']},
                               (FeatureType.DATA, 'FEATURES'))

# keep frames with > 80 % valid coverage
valid_data_predicate = ValidDataFractionPredicate(0.8)
filter_task = SimpleFilterTask((FeatureType.MASK, 'IS_VALID'), valid_data_predicate)

# linear interpolation of full time-series and date resampling
resampled_range = ('2019-01-01', '2019-12-31', 15)
linear_interp = LinearInterpolation(
    'FEATURES', # name of field to interpolate
    mask_feature=(FeatureType.MASK, 'IS_VALID'), # mask to be used in interpolation
    copy_features=[(FeatureType.MASK_TIMELESS, 'LULC')], # features to keep
    resample_range=resampled_range, # set the resampling range
    bounds_error=False # extrapolate with NaN's

# erode each class of the reference map
erosion = ErosionTask(mask_feature=(FeatureType.MASK_TIMELESS,'LULC','LULC_ERODED'), disk_radius=1)

# Uniformly sample about pixels from patches
n_samples = 125000 # half of pixels
ref_labels = list(range(11)) # reference labels to take into account when sampling
spatial_sampling = PointSamplingTask(
    sample_features=[  # tag fields to sample
        (FeatureType.DATA, 'FEATURES'),
        (FeatureType.MASK_TIMELESS, 'LULC_ERODED')

path_out_sampled = './eopatches_sampled'
if not os.path.isdir(path_out_sampled):
save = SaveTask(path_out_sampled, overwrite_permission=OverwritePermission.OVERWRITE_PATCH)
# Define the workflow
workflow = LinearWorkflow(

Run the EOWorkflow over all EOPatches


execution_args = []
for idx in range(len(patchIDs)):
        load: {'eopatch_folder': f'eopatch_{idx}'},
        spatial_sampling: {'seed': 42},
        save: {'eopatch_folder': f'eopatch_{idx}'}

executor = EOExecutor(workflow, execution_args, save_logs=True), multiprocess=True)


CPU times: user 56 ms, sys: 876 ms, total: 932 ms
Wall time: 3min 3s

6. Model construction and training

The patches are split into a train and test subset, where we split the patches for training and testing.

The test sample is hand picked because of the small set of patches, otherwise with a larged overall set, the training and testing patches should be randomly chosen.

The sampled features and labels are loaded and reshaped into \(n \times m\), where \(n\) represents the number of training pixels, and \(m = f \times t\) the number of all features (in this example 216), with \(f\) the size of bands and band combinations (in this example 9) and \(t\) the length of the resampled time-series (in this example 25)

LightGBM is used as a ML model. It is a fast, distributed, high performance gradient boosting framework based on decision tree algorithms, used for many machine learning tasks.

The default hyper-parameters are used in this example. For more info on parameter tuning, check the ReadTheDocs of the package.

# load sampled eopatches
eopatches = []
path_out_sampled = './eopatches_sampled'

for idx in range(len(patchIDs)):
    eopatches.append(EOPatch.load(f'{path_out_sampled}/eopatch_{idx}', lazy_loading=True))

eopatches = np.array(eopatches)
# Definition of the train and test patch IDs, take 80 % for train
test_ID = [0, 8, 16, 19, 20]
train_ID = np.argwhere(~np.in1d(patchIDs, patchIDs[test_ID])).squeeze(axis=-1)

# Set the features and the labels for train and test sets
features_train = np.array([['FEATURES_SAMPLED'] for eopatch in eopatches[train_ID]])
labels_train = np.array([eopatch.mask_timeless['LULC_ERODED_SAMPLED'] for eopatch in eopatches[train_ID]])
features_test = np.array([['FEATURES_SAMPLED'] for eopatch in eopatches[test_ID]])
labels_test = np.array([eopatch.mask_timeless['LULC_ERODED_SAMPLED'] for eopatch in eopatches[test_ID]])

# get shape
p1, t, w, h, f = features_train.shape
p2, t, w, h, f = features_test.shape
p = p1 + p2

# reshape to n x m
features_train = np.moveaxis(features_train, 1, 3).reshape(p1 * w * h, t * f)
labels_train = np.moveaxis(labels_train, 1, 2).reshape(p1 * w * h, 1).squeeze()
features_test = np.moveaxis(features_test, 1, 3).reshape(p2 * w * h, t * f)
labels_test = np.moveaxis(labels_test, 1, 2).reshape(p2 * w * h, 1).squeeze()

# remove points with no reference from training (so we dont train to recognize "no data")
mask_train = labels_train == 0
features_train = features_train[~mask_train]
labels_train = labels_train[~mask_train]

# remove points with no reference from test (so we dont validate on "no data", which doesn't make sense)
mask_test = labels_test == 0
features_test = features_test[~mask_test]
labels_test = labels_test[~mask_test]
(1466764, 225)

Set up and train the model


# Set up training classes
labels_unique = np.unique(labels_train)

# Set up the model
model = lgb.LGBMClassifier(

# train the model, labels_train)

# uncomment to save the model
joblib.dump(model, './model_SI_LULC.pkl')
CPU times: user 26min 57s, sys: 225 ms, total: 26min 58s
Wall time: 1min 43s

7. Validation

Validation of the model is a crucial step in data science. All models are wrong, but some are less wrong than others, so model evaluation is important.

In order to validate the model, we use the training set to predict the classes, and then compare the predicted set of labels to the “ground truth”.

Unfortunately, ground truth in the scope of EO is a term that should be taken lightly. Usually, it is not 100 % reliable due to several reasons:

  • Labels are determined at specific time, but land use can change (what was once a field, may now be a house)

  • Labels are overly generalized (a city is an artificial surface, but it also contains parks, forests etc.)

  • Some classes can have an overlap or similar definitions (part of a continuum, and not discrete distributions)

  • Human error (mistakes made when producing the reference map)

The validation is performed by evaluating various metrics, such as accuracy, precision, recall, \(F_1\) score, some of which are nicely described in this blog post.

# load the model
model_path = './model_SI_LULC.pkl'
model = joblib.load(model_path)

# predict the test labels
plabels_test = model.predict(features_test)

Get the overall accuracy (OA) and the weighted \(F_1\) score and the \(F_1\) score, precision, and recall for each class separately

class_labels = np.unique(labels_test)
class_names = [ for entry in LULC]
mask = np.in1d(plabels_test, labels_test)
pred = plabels_test[mask]
lbls = labels_test[mask]

f1_scores = metrics.f1_score(lbls, pred, labels=class_labels, average=None)
recall = metrics.recall_score(lbls, pred, labels=class_labels, average=None)
precision = metrics.precision_score(lbls, pred, labels=class_labels, average=None)

print('Classification accuracy {:.1f}%'.format(100 * metrics.accuracy_score(lbls, pred)))
print('Classification F1-score {:.1f}%'.format(100 * metrics.f1_score(lbls, pred, average='weighted')))
print('             Class              =  F1  | Recall | Precision')
print('         --------------------------------------------------')
for idx, lulctype in enumerate([class_names[idx] for idx in class_labels]):
    print('         * {0:20s} = {1:2.1f} |  {2:2.1f}  | {3:2.1f}'.format(lulctype,
                                                                         f1_scores[idx] * 100,
                                                                         recall[idx] * 100,
                                                                         precision[idx] * 100))
Classification accuracy 93.7%
Classification F1-score 93.6%

             Class              =  F1  | Recall | Precision
         * CULTIVATED_LAND      = 88.1 |  86.1  | 90.2
         * FOREST               = 98.6 |  98.8  | 98.4
         * GRASSLAND            = 87.8 |  89.0  | 86.6
         * SHRUBLAND            = 35.3 |  32.4  | 38.8
         * WATER                = 95.1 |  98.0  | 92.3
         * ARTIFICIAL_SURFACE   = 95.6 |  96.0  | 95.2

Plot the standard and transposed Confusion Matrix

# Define the plotting function
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                , ylabel='True label', xlabel='Predicted label', filename=None):
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    np.set_printoptions(precision=2, suppress=True)

    if normalize:
        cm = cm.astype('float') / (cm.sum(axis=1)[:, np.newaxis] + np.finfo(np.float).eps)

    plt.imshow(cm, interpolation='nearest', cmap=cmap, vmin=0, vmax=1)
    plt.title(title, fontsize=20)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90, fontsize=20)
    plt.yticks(tick_marks, classes, fontsize=20)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 color="white" if cm[i, j] > thresh else "black",

    plt.ylabel(ylabel, fontsize=20)
    plt.xlabel(xlabel, fontsize=20)
fig = plt.figure(figsize=(20, 20))

plt.subplot(1, 2, 1)
conf_matrix_gbm = metrics.confusion_matrix(lbls, pred)
                      classes=[name for idx, name in enumerate(class_names) if idx in class_labels],
                      ylabel='Truth (LAND COVER)',
                      xlabel='Predicted (GBM)',
                      title='Confusion matrix');

plt.subplot(1, 2, 2)
conf_matrix_gbm = metrics.confusion_matrix(pred, lbls)
                      classes=[name for idx, name in enumerate(class_names) if idx in class_labels],
                      xlabel='Truth (LAND COVER)',
                      ylabel='Predicted (GBM)',
                      title='Transposed Confusion matrix');


For most of the classes the model seems to perform well. Otherwise the training sample is probably too small to make a fair assesment. Additional problems arise due to the unbalanced training set. The image below shows the frequency of the classes used for model training, and we see that the problematic cases are all the under-represented classes: shrubland, water, wetland, and bareland.

Improving the reference map would also affect the end result, as, for example some classes are mixed up to some level.

fig = plt.figure(figsize=(20, 5))

label_ids, label_counts = np.unique(labels_train, return_counts=True), label_counts)
plt.xticks(range(len(label_ids)), [class_names[i] for i in label_ids], rotation=45, fontsize=20);

ROC curves and AUC metrics

Calculate precision and recall rates, draw ROC curves and calculate AUC.

class_labels = np.unique(np.hstack([labels_test, labels_train]))

scores_test = model.predict_proba(features_test)
labels_binarized = preprocessing.label_binarize(labels_test, classes=class_labels)

fpr = dict()
tpr = dict()
roc_auc = dict()

for idx,lbl in enumerate(class_labels):
    fpr[idx], tpr[idx], _ = metrics.roc_curve(labels_binarized[:, idx], scores_test[:, idx])
    roc_auc[idx] = metrics.auc(fpr[idx], tpr[idx])
/home/ubuntu/.pyenv/versions/3.7.6/envs/base/lib/python3.7/site-packages/sklearn/metrics/ UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
plt.figure(figsize=(20, 10))

for idx,lbl in enumerate(class_labels):
    if np.isnan(roc_auc[idx]):
    plt.plot(fpr[idx], tpr[idx], color=lulc_cmap.colors[lbl],
         lw=2, label=class_names[lbl] + ' (%0.5f)' % roc_auc[idx])

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 0.99])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=20)
plt.ylabel('True Positive Rate', fontsize=20)
plt.title('ROC Curve', fontsize=20)
plt.legend(loc="center right", prop={'size': 15})

Most important features

Let us now check which features are most important in the above classification. The LightGBM model already contains the information about feature importances, so we only need to query them.

# names of features
fnames = ['B2','B3','B4','B8','B11','B12','NDVI','NDWI','NDBI']

# get feature importances and reshape them to dates and features
feature_importances = model.feature_importances_.reshape((t, f))

fig = plt.figure(figsize=(15, 15))
ax = plt.gca()

# plot the importances
im = ax.imshow(feature_importances, aspect=0.25)
plt.xticks(range(len(fnames)), fnames, rotation=45, fontsize=20)
plt.yticks(range(t), [f'T{i}' for i in range(t)], fontsize=20)
plt.xlabel('Bands and band related features', fontsize=20)
plt.ylabel('Time frames', fontsize=20)

fig.subplots_adjust(wspace=0, hspace=0)

cb = fig.colorbar(im, ax=[ax], orientation='horizontal', pad=0.01, aspect=100)

Let’s plot the RGB for the most important date

# Draw the RGB image
path_out_sampled = './eopatches_sampled'
fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 20))

time_id = np.where(feature_importances == np.max(feature_importances))[0][0]

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out_sampled}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    ax.imshow(np.clip(['FEATURES'][time_id][..., [2, 1, 0]] * 2.5, 0, 1))
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)


8. Visualization of the results

The model has been validated, the remaining thing is to make the prediction on the whole AOI.

Here we define a workflow to make the model prediction on the existing EOPatces. The EOTask accepts the features and the names for the labels and scores. The latter is optional.

Define EOTasks

class PredictPatch(EOTask):
    Task to make model predictions on a patch. Provide the model and the feature,
    and the output names of labels and scores (optional)
    def __init__(self, model, features_feature, predicted_labels_name, predicted_scores_name=None):
        self.model = model
        self.features_feature = features_feature
        self.predicted_labels_name = predicted_labels_name
        self.predicted_scores_name = predicted_scores_name

    def execute(self, eopatch):
        ftrs = eopatch[self.features_feature[0]][self.features_feature[1]]

        t, w, h, f = ftrs.shape
        ftrs = np.moveaxis(ftrs, 0, 2).reshape(w * h, t * f)

        plabels = self.model.predict(ftrs)
        plabels = plabels.reshape(w, h)
        plabels = plabels[..., np.newaxis]
        eopatch.add_feature(FeatureType.MASK_TIMELESS, self.predicted_labels_name, plabels)

        if self.predicted_scores_name:
            pscores = self.model.predict_proba(ftrs)
            _, d = pscores.shape
            pscores = pscores.reshape(w, h, d)
            eopatch.add_feature(FeatureType.DATA_TIMELESS, self.predicted_scores_name, pscores)

        return eopatch

Define Tasks and the Workflow

load = LoadTask(path_out_sampled)

predict = PredictPatch(model, (FeatureType.DATA, 'FEATURES'), 'LBL_GBM', 'SCR_GBM')

save = SaveTask(str(path_out_sampled), overwrite_permission=OverwritePermission.OVERWRITE_PATCH)

export_tiff = ExportToTiff((FeatureType.MASK_TIMELESS, 'LBL_GBM'))
tiff_location = './predicted_tiff'
if not os.path.isdir(tiff_location):

workflow = LinearWorkflow(

Run the prediction and export to GeoTIFF images

Here we use the EOExecutor to run the workflow in parallel.

# create a list of execution arguments for each patch
execution_args = []
for i in range(len(patchIDs)):
            load: {'eopatch_folder': f'eopatch_{i}'},
            export_tiff: {'filename': f'{tiff_location}/prediction_eopatch_{i}.tiff'},
            save: {'eopatch_folder': f'eopatch_{i}'}

# run the executor
executor = EOExecutor(workflow, execution_args), multiprocess=False)

# merge with (with compression) using bash command magic
# gdal has to be installed on your computer!
! -o predicted_tiff/merged_prediction.tiff -co compress=LZW predicted_tiff/prediction_eopatch_*
/bin/sh: 1: not found
CPU times: user 7.05 ms, sys: 68.5 ms, total: 75.5 ms
Wall time: 254 ms

Visualise the prediction

path_out_sampled = './eopatches_sampled'

fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(20, 25))

for i in tqdm(range(len(patchIDs))):
    eopatch = EOPatch.load(f'{path_out_sampled}/eopatch_{i}', lazy_loading=True)
    ax = axs[i//5][i%5]
    im = ax.imshow(eopatch.mask_timeless['LBL_GBM'].squeeze(), cmap=lulc_cmap, norm=lulc_norm)
    del eopatch

fig.subplots_adjust(wspace=0, hspace=0)

cb = fig.colorbar(im, ax=axs.ravel().tolist(), orientation='horizontal', pad=0.01, aspect=100)
cb.set_ticks([ for entry in LULC])[ for entry in LULC], rotation=45, fontsize=15)


Visual inspection of patches

Here is just a simple piece of code that allows a closer inspection of the predicted labels.

Random subsets of patches are chosen, where prediction and ground truth are compared. For visual aid the mask of differences and the true color image are also provided.

In majority of the cases, differences seem to lie on the border of different structures.

# Draw the Reference map
fig = plt.figure(figsize=(20, 20))

idx = np.random.choice(range(len(patchIDs)))
inspect_size = 100

eopatch = EOPatch.load(f'{path_out_sampled}/eopatch_{idx}', lazy_loading=True)

w, h = eopatch.mask_timeless['LULC'].squeeze().shape

w_min = np.random.choice(range(w - inspect_size))
h_min = np.random.choice(range(h - inspect_size))

ax = plt.subplot(2, 2, 1)
plt.imshow(eopatch.mask_timeless['LULC'].squeeze()[w_min: w_min + inspect_size, h_min : h_min + inspect_size],
           cmap=lulc_cmap, norm=lulc_norm)
plt.title('Ground Truth', fontsize=20)

ax = plt.subplot(2, 2, 2)
plt.imshow(eopatch.mask_timeless['LBL_GBM'].squeeze()[w_min: w_min + inspect_size, h_min: h_min + inspect_size],
           cmap=lulc_cmap, norm=lulc_norm)
plt.title('Prediction', fontsize=20)

ax = plt.subplot(2, 2, 3)
mask = eopatch.mask_timeless['LBL_GBM'].squeeze() != eopatch.mask_timeless['LULC'].squeeze()
plt.imshow(mask[w_min: w_min + inspect_size, h_min: h_min + inspect_size], cmap='gray')
plt.title('Difference', fontsize=20)

ax = plt.subplot(2, 2, 4)
image = np.clip(['FEATURES'][8][..., [2, 1, 0]] * 3.5, 0, 1)
plt.imshow(image[w_min: w_min + inspect_size, h_min: h_min + inspect_size])
plt.title('True Color', fontsize=20)

fig.subplots_adjust(wspace=0.1, hspace=0.1)