5. API Reference

5.1. Datasets and Datastores

5.1.1. Exploring Data

esa_climate_toolbox.core.find_data_store(ds_id: str) Tuple[str | None, DataStore | None]

Find the data store that includes the given ds_id. This will raise an exception if the ds_id is given in more than one data store.

Parameters:

ds_id – A data source identifier.

Returns:

All data sources matching the given constrains.

esa_climate_toolbox.core.get_store(store_id: str)

Returns the data store of the given name. :param store_id: The name of the store should have.

Returns:

A data store

esa_climate_toolbox.core.get_search_params(store_id: str, data_type: str) Dict

Returns potential search parameters that can be used to search for datasets in a data store.

Parameters:
  • store_id – The id of the store which shall be searched for data

  • data_type – An optional data type to specify the type of data to be searched

Returns:

A dictionary containing search parameters

esa_climate_toolbox.core.list_datasets(store_id: str = None, data_type: str | None | type | DataType = None, include_attrs: Container[str] = None) List[str] | List[Tuple[str, Dict[str, Any]]]

Returns the names of datasets of a given store.

Parameters:
  • store_id – The name of the data store

  • data_type – A datatype that may be provided to restrict the search, e.g., ‘dataset’ or ‘geodataframe’

  • include_attrs – An optional list to retrieve additional meta information if required.

Returns:

Either a list of the dataset names within the store, or a list of tuples, each consisting of a name and a dictionary with additional information.

esa_climate_toolbox.core.list_ecvs() List[str]

Returns a list of names of essential climate variables served by the ESA Climate Toolbox.

Returns:

A list of names of essential climate variables.

esa_climate_toolbox.core.list_ecv_datasets(ecv: str, data_type: str | None | type | DataType = None, include_attrs: Container[str] = None) List[str] | List[Tuple[str, Dict[str, Any]]]

Returns the names of datasets for a given essential climate variable.

Parameters:
  • ecv – The name of the essential climate variable

  • data_type – A datatype that may be provided to restrict the search, e.g., ‘dataset’ or ‘geodataframe’

  • include_attrs – An optional list to retrieve additional meta information if required.

Returns:

Either a list of dataset names for the given ecv, or a list of tuples, each consisting of a name and a dictionary with additional information.

esa_climate_toolbox.core.list_stores() List[str]

Lists the names of the data stores which are provided by Returns meta information about an operation.

Parameters:
  • op_name – The name of the operation for which meta information shall be provided.

  • op_registry – An optional OpRegistry, in case the default one should not be used.

Returns:

A dictionary representation of an operator’s meta info, providing information about input parameters and the expected output.

esa_climate_toolbox.core.search(store_id: str, data_type: str = None, **search_params) List[Dict]

Searches in a data store for data that meet the given search criteria.

Parameters:
  • store_id – The id of the store which shall be searched for data

  • data_type – An optional data type to specify the type of data to be searched

  • search_params – Store-specific additional search parameters

Returns:

A list of dictionaries providing detailed information about the data that meet the specified criteria.

5.1.2. Managing Data

esa_climate_toolbox.core.add_local_store(root: str, store_id: str = None, max_depth: int = 1, read_only: bool = False, includes: str = None, excludes: str = None, title: str = None, description: str = None, persist: bool = True) str

Registers a new data store in the ESA Climate Toolbox to access locally stored data.

Parameters:
  • root – The path to the data.

  • store_id – The name the store should have. There must not already be a store of the same name.

  • max_depth – The maximum level of sub-directories that will be browsed for data. Default is 1, i.e., only the data located in the root path will be considered.

  • read_only – Whether the store is read-only. Default is false.

  • includes – Allows to specify a pattern about which data shall be served by the store (e.g., aerosol*.nc)

  • excludes – Allows to specify a pattern about which data shall not be served by the store (e.g., aerosol*.zarr)

  • title – An optional title for the data store

  • description – An optional description of the data store

  • persist – Whether the data store shall be registered permanently, otherwise it will only be for this session. Default is True.

Returns:

The id of the newly created store.

esa_climate_toolbox.core.add_store(store_type: str, store_params: Mapping[str, Any] = None, store_id: str = None, title: str = None, description: str = None, user_data: Any = None, persist: bool = True) str

Registers a new data store in the ESA Climate Toolbox. This function allows to also specify non-local data stores.

Parameters:
  • store_type – The type of data store to create, e.g., ‘s3’.

  • store_params – A mapping containing store-specific parameters which are required to initiate the store.

  • store_id – The name the store should have. There must not already be a store of the same name.

  • title – An optional title for the data store

  • description – An optional description of the data store

  • user_data – Any additional user data

  • persist – Whether the data store shall be registered permanently, otherwise it will only be for this session. Default is True.

Returns:

The id of the newly created store.

esa_climate_toolbox.core.remove_store(store_id: str, persist: bool = True)

Removes a store from the internal store registry. No actual data will be deleted.

Parameters:
  • store_id – The name of the store to be removed

  • persist – Whether the data store shall be unregistered permanently, otherwise it will only be for this session. Default is True.

Returns:

Either a list of dataset names for the given ecv, or a list of tuples, each consisting of a name and a dictionary with additional information.

5.1.3. Reading and Writing Data

esa_climate_toolbox.core.get_output_store_id() str | None

Returns the name of the store that by default will be used for writing.

Returns:

The id of the default output store.

esa_climate_toolbox.core.get_supported_formats(data: Any, store_id: str) List[str]

Returns the list of formats to which the store at the given store_id may write the given data.

Parameters:
  • data – The data which shall be written

  • store_id – The id of the store to which the data shall be written

Returns:

A list of supported output formats

esa_climate_toolbox.core.open_data(dataset_id: str, time_range: Tuple[str, str] | Tuple[datetime, datetime] | Tuple[date, date] | str = None, region: Polygon | List[Tuple[float, float]] | str | Tuple[float, float, float, float] = None, var_names: List[str] | str = None, data_store_id: str = None, monitor: Monitor = Monitor.NONE) Tuple[Any, str]

Open a dataset from a data store.

Parameters:
  • dataset_id – The identifier of the dataset. Must not be empty.

  • time_range – An optional time constraint comprising start and end date. If given, it must be a TimeRangeLike.

  • region – An optional region constraint. If given, it must be a PolygonLike.

  • var_names – Optional names of variables to be included. If given, it must be a VarNamesLike.

  • data_store_id – Optional data store identifier. If given, ds_id will only be looked up from the specified data store.

  • monitor – A progress monitor

Returns:

A tuple consisting of a new dataset instance and its id

esa_climate_toolbox.core.set_output_store(store_id: str)

Specifies which store shall be the standard output store. This value is not persisted and must be set every session.

Parameters:

store_id – The name of the store that shall be the output store.

esa_climate_toolbox.core.write_data(data: Any, data_id: str = None, store_id: str = None, format_id: str = None, replace: bool = False, monitor: Monitor = Monitor.NONE) str

Writes data

Parameters:
  • data – The data which shall be written

  • data_id – A data id under which the data shall be written to the store. If not given, a data id will be created.

  • store_id – The id of the store to which the data shall be written. If none is given, the data is written to the standard output store.

  • format_id – A format that shall be used to write the data. If none is given, the data will be written in the default format for the data type, e.g., ‘zarr’ for datasets.

  • replace – Whether a dataset with the same id in the store shall be replaced. If False, an exception will be raised. Default is False.

  • monitor – A monitor to measure the writing process

Returns:

The data id under which the data can be accessed from the store.

5.2. Operations

5.2.1. Aggregation

esa_climate_toolbox.ops.climatology(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Create a ‘mean over years’ dataset by averaging the values of the given input dataset over all years. The output is a climatological dataset with the same resolution as the input dataset. E.g. a daily input dataset will create a daily climatology consisting of 365 days, a monthly input dataset will create a monthly climatology, etc.

Seasonal input datasets must have matching seasons over all years denoted by the same date each year. E.g., first date of each quarter. The output dataset will then be a seasonal climatology where each season is denoted with the same date as in the input dataset.

For further information on climatological datasets, see http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#climatological-statistics

Parameters:
  • ds – A dataset to average

  • var – If given, only these variables will be preserved in the resulting dataset

  • monitor – A progress monitor

Returns:

A climatological long term average dataset

esa_climate_toolbox.ops.reduce(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Reduce the given variables of the given dataset along the given dimensions. If no variables are given, all variables of the dataset will be reduced. If no dimensions are given, all dimensions will be reduced. If no variables have been given explicitly, it can be set that only variables featuring numeric values should be reduced.

Parameters:
  • ds – Dataset to reduce

  • var – Variables in the dataset to reduce

  • dim – Dataset dimensions along which to reduce

  • method – reduction method

  • monitor – A progress monitor

esa_climate_toolbox.ops.temporal_aggregation(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Perform aggregation of dataset according to the given aggregation method and time period period.

Note that the operation does not perform weighting. Depending on the combination of input and output resolutions, as well as aggregation method, the resulting dataset might yield unexpected results.

The possible values if period are the offset-aliases supported by the Pandas package: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Some examples for period values:

  • ‘QS-DEC’ will result in a dataset aggregated to DJF, MAM, JJA, SON seasons, each denoted by the first date of the season.

  • ‘QS-JUN’ produces an output dataset on a quarterly resolution where the year ends in 1st of June and each quarter is denoted by its first date.

  • ‘8MS’ produces an output dataset on an eight-month resolution where each period is denoted by the first date. Note that such periods will not be consistent over years.

  • ‘8D’ produces a dataset on an eight day resolution.

Parameters:
  • ds – Dataset to aggregate

  • method – Aggregation method

  • period – Aggregation time period

Returns:

Aggregated dataset

5.2.2. Anomalies

esa_climate_toolbox.ops.anomaly_external(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Calculate anomaly with external reference data, for example, a climatology. The given reference dataset is expected to consist of 12 time slices, one for each month.

The returned dataset will contain the variable names found in both - the reference and the given dataset. Names found in the given dataset, but not in the reference, will be dropped from the resulting dataset. The calculated anomaly will be against the corresponding month of the reference data. E.g. January against January, etc.

In case spatial extents differ between the reference and the given dataset, the anomaly will be calculated on the intersection.

Parameters:
  • ds – The dataset to calculate anomalies from

  • file – Path to reference data file

  • transform – Apply the given transformation before calculating the anomaly. For supported operations see help on ‘ds_arithmetics’ operation.

  • monitor – a progress monitor.

Returns:

The anomaly dataset

esa_climate_toolbox.ops.anomaly_internal(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Calculate anomaly using as reference data the mean of an optional region and time slice from the given dataset. If no time slice/spatial region is given, the operation will calculate anomaly using the mean of the whole dataset as the reference.

This is done for each data array in the dataset. :param ds: The dataset to calculate anomalies from :param time_range: Time range to use for reference data :param region: Spatial region to use for reference data :param monitor: a progress monitor. :return: The anomaly dataset

5.2.3. Arithmetics

esa_climate_toolbox.ops.arithmetics(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Do arithmetic operations on the given dataset by providing a list of arithmetic operations and the corresponding constant. The operations will be applied to the dataset in the order in which they appear in the list. For example: ‘log,+5,-2,/3,*2’

Currently supported arithmetic operations: log,log10,log2,log1p,exp,+,-,/,*

where:

log - natural logarithm log10 - base 10 logarithm log2 - base 2 logarithm log1p - log(1+x) exp - the exponential

The operations will be applied element-wise to all arrays of the dataset.

Parameters:
  • ds – The dataset to which to apply arithmetic operations

  • ops – A comma separated list of arithmetic operations to apply

  • monitor – a progress monitor.

Returns:

The dataset with given arithmetic operations applied

esa_climate_toolbox.ops.diff(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Calculate the difference of two datasets (ds - ds2). This is done by matching variable names in the two datasets against each other and taking the difference of matching variables.

If lat/lon/time extents differ between the datasets, the default behavior is to take the intersection of the datasets and run subtraction on that. However, broadcasting is possible. E.g. ds(lat/lon/time) - ds(lat/lon) is valid. In this case the subtrahend will be stretched to the size of ds(lat/lon/time) so that it can be subtracted. This also works if the subtrahend is a single time slice of arbitrary temporal position. In this case, the time dimension will be squeezed out leaving a lat/lon dataset.

Parameters:
  • ds – The minuend dataset

  • ds2 – The subtrahend dataset

  • monitor – a progress monitor.

Returns:

The difference dataset

5.2.4. Coregistration

esa_climate_toolbox.ops.coregister(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Perform coregistration of two datasets by resampling the replica dataset onto the grid of the primary. If upsampling has to be performed, this is achieved using interpolation, if downsampling has to be performed, the pixels of the replica dataset are aggregated to form a coarser grid.

The returned dataset will contain the lat/lon intersection of provided primary and replica datasets, resampled unto the primary grid frequency.

This operation works on datasets whose spatial dimensions are defined on pixel-registered grids that are equidistant in lat/lon coordinates, i.e., data points define the middle of a pixel and pixels have the same size across the dataset.

This operation will resample all variables in a dataset, as the lat/lon grid is defined per dataset. It works only if all variables in the dataset have lat and lon as dimensions.

For an overview of downsampling/upsampling methods used in this operation, please see https://github.com/CAB-LAB/gridtools

Whether upsampling or downsampling has to be performed is determined automatically based on the relationship of the grids of the provided datasets.

Parameters:
  • ds_primary – The dataset whose grid is used for resampling

  • ds_replica – The dataset that will be resampled

  • method_us – Interpolation method to use for upsampling.

  • method_ds – Interpolation method to use for downsampling.

  • monitor – a progress monitor.

Returns:

The replica dataset resampled on the grid of the primary

5.2.5. Data Frame Operations

esa_climate_toolbox.ops.aggregate_statistics(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Aggregate columns into count, mean, median, sum, std, min, and max. Return a new (Geo)DataFrame with a single row containing all aggregated values. Specify whether the geometries of the GeoDataFrame are to be aggregated. All geometries are merged union-like.

The return data type will always be the same as the input data type.

Parameters:
  • df – The (Geo)DataFrame to be analysed

  • var_names – Variables to be aggregated (‘None’ uses all aggregatable columns)

  • aggregate_geometry – Aggregate (union like) the geometry and add it to the resulting GeoDataFrame

  • monitor – Monitor for progress bar

Returns:

returns either DataFrame or GeoDataFrame. Keeps input data type

esa_climate_toolbox.ops.data_frame_max(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Select the first record of a data frame for which the given variable value is maximal.

Parameters:
  • df – The data frame or dataset.

  • var – The variable.

Returns:

A new, one-record data frame.

esa_climate_toolbox.ops.data_frame_min(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Select the first record of a data frame for which the given variable value is minimal.

Parameters:
  • df – The data frame or dataset.

  • var – The variable.

Returns:

A new, one-record data frame.

esa_climate_toolbox.ops.data_frame_subset(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Create a GeoDataFrame subset from given variables (data frame columns) and/or region.

Parameters:
  • gdf – A GeoDataFrame.

  • region_op – The geometric operation to be performed if region is given.

  • region – A region polygon used to filter rows.

  • var_names – The variables (columns) to select.

Returns:

A GeoDataFrame subset.

esa_climate_toolbox.ops.find_closest(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Find the max_results records closest to given location in the given GeoDataFrame gdf. Return a new GeoDataFrame containing the closest records.

If dist_col_name is given, store the actual distances in this column.

Distances are great-circle distances measured in degrees from a representative center of the given location geometry to the representative centres of each geometry in the gdf.

Parameters:
  • gdf – The GeoDataFrame.

  • location – A location given as arbitrary geometry.

  • max_results – Maximum number of results.

  • max_dist – Ignore records whose distance is greater than this value in degrees.

  • dist_col_name – Optional name of a new column that will store the actual distances.

  • monitor – A progress monitor.

Returns:

A new GeoDataFrame containing the closest records.

esa_climate_toolbox.ops.query(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Select records from the given data frame where the given conditional query expression evaluates to “True”.

If the data frame df contains a geometry column (a GeoDataFrame object), then the query expression query_expr can also contain geometric relationship tests, for example the expression "population > 100000 and @within('-10, 34, 20, 60')" could be used on a data frame with the population and a geometry column to query for larger cities in West-Europe.

The geometric relationship tests are * @almost_equals(geom) - does a feature’s geometry almost equal the given geom; * @contains(geom) - does a feature’s geometry contain the given geom; * @crosses(geom) - does a feature’s geometry cross the given geom; * @disjoint(geom) - does a feature’s geometry not at all intersect the given geom; * @intersects(geom) - does a feature’s geometry intersect with given geom; * @touches(geom) - does a feature’s geometry have a point in common with given geom but does not intersect it; * @within(geom) - is a feature’s geometry contained within given geom.

The geom argument may be a point "<lon>, <lat>" text string, a bounding box "<lon1>, <lat1>, <lon2>, <lat2>" text, or any valid geometry WKT.

Parameters:
  • df – The data frame or dataset.

  • query_expr – The conditional query expression.

Returns:

A new data frame.

5.2.6. Resampling

esa_climate_toolbox.ops.resample(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Resample a dataset to the provided x- and y-resolution. The resolution must be given in the units of the CRS. It can be set which method to use to upsample integer or float variables (in case the new resolution is finer than the old one) or to downsample them (in case the new resolution is coarser).

Parameters:
  • ds – The input dataset.

  • x_res – The resolution in x-direction.

  • y_res – The resolution in y-direction.

  • upsampling_float – The upsampling method to be used for float values. This value is only used when the new resolution is finer than the previous one. Allowed values are ‘nearest_neighbor’, ‘bilinear’, ‘2nd-order spline’, ‘cubic’, ‘4th-order spline’, and ‘5th-order spline’. The default is ‘bilinear’.

  • upsampling_int – The upsampling method to be used for integer and boolean values. This value is only used when the new resolution is finer than the previous one. Allowed values are ‘nearest_neighbor’, ‘bilinear’, ‘2nd-order spline’, ‘cubic’, ‘4th-order spline’, and ‘5th-order spline’. The default is ‘nearest_neighbor’.

  • downsampling_float – The downsampling method to be used for float values. This value is only used when the new resolution is coarser than the previous one. Allowed values are ‘nearest_neighbor’, ‘mean’, ‘min’, and ‘max’. The default is ‘mean’.

  • downsampling_int – The downsampling method to be used for integer and boolean values. This value is only used when the new resolution is coarser than the previous one. Allowed values are ‘nearest_neighbor’, ‘mean’, ‘min’, and ‘max’. The default is ‘nearest_neighbor’.

Returns:

A new dataset resampled to the new resolutions.

esa_climate_toolbox.ops.resample_2d(src, w, h, ds_method=54, us_method=11, fill_value=None, mode_rank=1, out=None)

Resample a 2-D grid to a new resolution.

Parameters:
  • src – 2-D ndarray

  • wint New grid width

  • hint New grid height

  • ds_method – one of the DS_ constants, optional Grid cell aggregation method for a possible downsampling

  • us_method – one of the US_ constants, optional Grid cell interpolation method for a possible upsampling

  • fill_valuescalar, optional If None, it is taken from src if it is a masked array, otherwise from out if it is a masked array, otherwise numpy’s default value is used.

  • mode_rankscalar, optional The rank of the frequency determined by the ds_method DS_MODE. One (the default) means most frequent value, two means second most frequent value, and so forth.

  • out – 2-D ndarray, optional Alternate output array in which to place the result. The default is None; if provided, it must have the same shape as the expected output.

Returns:

An resampled version of the src array.

esa_climate_toolbox.ops.downsample_2d(src, w, h, method=54, fill_value=None, mode_rank=1, out=None)

Downsample a 2-D grid to a lower resolution by aggregating original grid cells.

Parameters:
  • src – 2-D ndarray

  • wint Grid width, which must be less than or equal to src.shape[-1]

  • hint Grid height, which must be less than or equal to src.shape[-2]

  • method – one of the DS_ constants, optional Grid cell aggregation method

  • fill_valuescalar, optional If None, it is taken from src if it is a masked array, otherwise from out if it is a masked array, otherwise numpy’s default value is used.

  • mode_rankscalar, optional The rank of the frequency determined by the method DS_MODE. One (the default) means most frequent value, two means second most frequent value, and so forth.

  • out – 2-D ndarray, optional Alternate output array in which to place the result. The default is None; if provided, it must have the same shape as the expected output.

Returns:

A downsampled version of the src array.

esa_climate_toolbox.ops.upsample_2d(src, w, h, method=11, fill_value=None, out=None)

Upsample a 2-D grid to a higher resolution by interpolating original grid cells.

Parameters:
  • src – 2-D ndarray

  • wint Grid width, which must be greater than or equal to src.shape[-1]

  • hint Grid height, which must be greater than or equal to src.shape[-2]

  • method – one of the US_ constants, optional Grid cell interpolation method

  • fill_valuescalar, optional If None, it is taken from src if it is a masked array, otherwise from out if it is a masked array, otherwise numpy’s default value is used.

  • out – 2-D ndarray, optional Alternate output array in which to place the result. The default is None; if provided, it must have the same shape as the expected output.

Returns:

An upsampled version of the src array.

5.2.7. Subsetting

esa_climate_toolbox.ops.subset_spatial(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Do a spatial subset of the dataset

Parameters:
  • ds – Dataset to subset

  • region – Spatial region to subset

  • mask – Should values falling in the bounding box of the polygon but not the polygon itself be masked with NaN.

  • monitor – A monitor to report the progress of the process

Returns:

Subset dataset

esa_climate_toolbox.ops.subset_temporal(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Do a temporal subset of the dataset.

Parameters:
  • ds – Dataset or dataframe to subset

  • time_range – Time range to select

Returns:

Subset dataset

esa_climate_toolbox.ops.subset_temporal_index(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Do a temporal indices based subset

Parameters:
  • ds – Dataset or dataframe to subset

  • time_ind_min – Minimum time index to select

  • time_ind_max – Maximum time index to select

Returns:

Subset dataset

5.2.8. Timeseries

esa_climate_toolbox.ops.tseries_point(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Extract time-series from ds at given lon, lat position using interpolation method for each var given in a comma separated list of variables.

The operation returns a new timeseries dataset, that contains the point timeseries for all required variables with original variable meta-information preserved.

If a variable has more than three dimensions, the resulting timeseries variable will preserve all other dimensions except for lon/lat.

Parameters:
  • ds – The dataset from which to perform timeseries extraction.

  • point – Point to extract, e.g. (lon,lat)

  • var – Variable(s) for which to perform the timeseries selection if none is given, all variables in the dataset will be used.

  • method – Interpolation method to use.

Returns:

A timeseries dataset

esa_climate_toolbox.ops.tseries_mean(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Extract spatial mean timeseries of the provided variables, return the dataset that in addition to all the information in the given dataset contains also timeseries data for the provided variables, following naming convention ‘var_name1_ts_mean’. In addition, the standard deviation is computed.

If a data variable with more dimensions than time/lat/lon is provided, the data will be reduced by taking the mean of all data values at a single time position resulting in one dimensional timeseries data variable.

Parameters:
  • ds – The dataset from which to perform timeseries extraction.

  • var – Variables for which to perform timeseries extraction

  • mean_suffix – Mean suffix to use for resulting datasets

  • std_suffix – Std suffix to use for resulting datasets

  • monitor – a progress monitor.

Returns:

Dataset with timeseries variables

5.2.9. Misc

esa_climate_toolbox.ops.detect_outliers(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Detect outliers in the given Dataset.

When mask=True the input dataset should not contain nan values, otherwise all existing nan values will be marked as ‘outliers’ in the mask data array added to the output dataset.

Parameters:
  • ds – The dataset or dataframe for which to do outlier detection

  • var – Variable or variables in the dataset to which to do outlier detection. Note that when multiple variables are selected, absolute threshold values might not make much sense. Wild cards can be used to select multiple variables matching a pattern.

  • threshold_low – Values less or equal to this will be removed/masked

  • threshold_high – Values greater or equal to this will be removed/masked

  • quantiles – If True, threshold values are treated as quantiles, otherwise as absolute values.

  • mask – If True, an ancillary variable containing flag values for outliers will be added to the dataset. Otherwise, outliers will be replaced with nan directly in the data variables.

  • monitor – A progress monitor.

Returns:

The dataset with outliers masked or replaced with nan

esa_climate_toolbox.ops.merge(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Merge up to four datasets to produce a new dataset with combined variables from each input dataset.

This is a wrapper for the xarray.merge() function.

For documentation refer to xarray documentation at http://xarray.pydata.org/en/stable/generated/xarray.Dataset.merge.html#xarray.Dataset.merge

The compat argument indicates how to compare variables of the same name for potential conflicts:

  • “broadcast_equals”: all values must be equal when variables are broadcast against each other to ensure common dimensions.

  • “equals”: all values and dimensions must be the same.

  • “identical”: all values, dimensions and attributes must be the same.

  • “no_conflicts”: only values which are not null in both datasets must be

    equal. The returned dataset then contains the combination of all non-null values.

Parameters:
  • ds_1 – The first input dataset.

  • ds_2 – The second input dataset.

  • ds_3 – An optional 3rd input dataset.

  • ds_4 – An optional 4th input dataset.

  • join – How to combine objects with different indexes.

  • compat – How to compare variables of the same name for potential conflicts.

Returns:

A new dataset with combined variables from each input dataset.

esa_climate_toolbox.ops.normalize(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Normalize the geo- and time-coding upon opening the given dataset w.r.t. to a common (CF-compatible) convention used within the ESA Climate Toolbox. This will maximize the compatibility of a dataset for usage with operations.

That is, * variables named “latitude” will be renamed to “lat”; * variables named “longitude” or “long” will be renamed to “lon”;

Then, for equi-rectangular grids, * Remove 2D “lat” and “lon” variables; * Two new 1D coordinate variables “lat” and “lon” will be generated from original 2D forms.

Finally, it will be ensured that a “time” coordinate variable will be of type datetime.

Parameters:

ds – The dataset to normalize.

Returns:

The normalized dataset, or the original dataset, if it is already “normal”.

esa_climate_toolbox.ops.adjust_spatial_attrs(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Adjust the global spatial attributes of the dataset by doing some introspection of the dataset and adjusting the appropriate attributes accordingly.

In case the determined attributes do not exist in the dataset, these will be added.

For more information on suggested global attributes see Attribute Convention for Data Discovery

Parameters:
  • ds – Dataset to adjust

  • allow_point – Whether a dataset containing a single point is allowed

Returns:

Adjusted dataset

esa_climate_toolbox.ops.adjust_temporal_attrs(*args, monitor: Monitor = Monitor.NONE, **kwargs)

Adjust the global temporal attributes of the dataset by doing some introspection of the dataset and adjusting the appropriate attributes accordingly.

In case the determined attributes do not exist in the dataset, these will be added.

If the temporal attributes exist, but the dataset lacks a variable ‘time’, a new dimension ‘time’ of size one will be added and related coordinate variables ‘time’ and ‘time_bnds’ are added to the dataset. The dimension of all non-coordinate variables will be expanded by the new time dimension.

For more information on suggested global attributes see Attribute Convention for Data Discovery

Parameters:

ds – Dataset to adjust

Returns:

Adjusted dataset

5.3. Operation Registration API

class esa_climate_toolbox.core.Operation(wrapped_op: Callable, op_meta_info=None)

An Operation comprises a wrapped callable (e.g. function, constructor, lambda form) and additional meta-information about the wrapped operation itself and its inputs and outputs.

Parameters:
  • wrapped_op – some callable object that will be wrapped.

  • op_meta_info – operation meta information.

property op_meta_info: OpMetaInfo
Returns:

Meta-information about the operation, see esa_climate_toolbox.core.op.OpMetaInfo.

property wrapped_op: Callable
Returns:

The actual operation object which may be any callable.

class esa_climate_toolbox.core.OpMetaInfo(qualified_name: str, has_monitor: bool = False, header: dict = None, input_names: List[str] = None, inputs: Dict[str, Dict[str, Any]] = None, outputs: Dict[str, Dict[str, Any]] = None)

Represents meta-information about an operation:

  • qualified_name: a an ideally unique, qualified operation name

  • header: dictionary of arbitrary operation attributes

  • input: ordered dictionary of named inputs, each mapping to a dictionary of arbitrary input attributes

  • output: ordered dictionary of named outputs, each mapping to a dictionary of arbitrary output attributes

Warning: OpMetaInfo` objects should be considered immutable. However, the dictionaries mentioned above are returned “as-is”, mostly for performance reasons. Changing entries in these dictionaries directly may cause unwanted side-effects.

Parameters:
  • qualified_name – The operation’s qualified name.

  • has_monitor – Whether the operation supports a Monitor keyword argument named monitor.

  • header – Header information dictionary.

  • input_names – Input information dictionary.

  • inputs – Input information dictionary.

  • outputs – Output information dictionary.

MONITOR_INPUT_NAME = 'monitor'

The constant 'monitor', which is the name of an operation input that will receive a Monitor object as value.

RETURN_OUTPUT_NAME = 'return'

The constant 'return', which is the name of a single, unnamed operation output.

property has_monitor: bool
Returns:

True if the operation supports a Monitor value as additional keyword argument named monitor.

property has_named_outputs: bool
Returns:

True if the output value of the operation is expected be a dictionary-like mapping of output names to output values.

property header: Dict[str, Any]
Returns:

Operation header attributes.

property input_names: List[str]

The input names in the order they have been declared.

Returns:

List of input names.

property inputs: Dict[str, Dict[str, Any]]

Mapping from an input name to a dictionary of properties describing the input.

Returns:

Named inputs.

property outputs: Dict[str, Dict[str, Any]]

Mapping from an output name to a dictionary of properties describing the output.

Returns:

Named outputs.

property qualified_name: str
Returns:

Fully qualified name of the actual operation.

set_default_input_values(input_values: Dict)

If any missing input value in input_values, set value of “default_value” property, if it exists.

Parameters:

input_values – The dictionary of input values that will be modified.

to_json_dict(data_type_to_json=None) Dict[str, Any]

Return a JSON-serializable dictionary representation of this object. E.g. values of the data_type` property are converted from Python types to their string representation.

Returns:

A JSON-serializable dictionary

validate_input_values(input_values: ~typing.Dict, except_types=None, validation_exception_class=<class 'ValueError'>)

Validate given input_values against the operation’s input properties.

Parameters:
  • input_values – The dictionary of input values.

  • except_types – A set of types or None. If an input value’s type is in this set, it will not be validated against the various input properties, such as data_type, nullable, value_set, value_range.

  • validation_exception_class – The exception class to be used to raise exceptions if validation fails. Must derive from BaseException. Defaults to ValueError.

Raises:

validation_error_class – If input_values are invalid w.r.t. to the operation’s input properties.

validate_output_values(output_values: ~typing.Dict, validation_exception_class: type = <class 'ValueError'>)

Validate given output_values against the operation’s output properties.

Parameters:
  • output_values – The dictionary of output values.

  • validation_exception_class – The exception class to be used to raise exceptions if validation fails. Must derive from BaseException. Defaults to ValueError.

Raises:

validation_error_class – If output_values are invalid w.r.t. to the operation’s output properties.

esa_climate_toolbox.core.op(tags=UNDEFINED, version=UNDEFINED, res_pattern=UNDEFINED, deprecated=UNDEFINED, registry=OP_REGISTRY, **properties)

op is a decorator function that registers a Python function or class in the default operation registry or the one given by registry, if any. Any other keywords arguments in header are added to the operation’s meta-information header. Classes annotated by this decorator must have callable instances.

When a function is registered, an introspection is performed. During this process, initial operation the meta-information header property description is derived from the function’s docstring.

If any output of this operation will have its history information automatically updated, there should be version information found in the operation header. Thus it’s always a good idea to add it to all operations:

@op(version='X.x')
Parameters:
  • tags – An optional list of string tags.

  • version – An optional version string.

  • res_pattern – An optional pattern that will be used to generate the names for data resources that are used to hold a reference to the objects returned by the operation. Currently, the only pattern variable that is supported and that must be present is {index} which will be replaced by an integer number that is guaranteed to produce a unique resource name.

  • deprecated – An optional boolean or a string. If a string is used, it should explain why the operation has been deprecated and which new operation to use instead. If set to True, the operation’s doc-string should explain the deprecation.

  • registry – The operation registry.

  • properties – Other properties (keyword arguments) that will be added to the meta-information of operation.

esa_climate_toolbox.core.op_input(input_name: str, default_value=UNDEFINED, units=UNDEFINED, data_type=UNDEFINED, nullable=UNDEFINED, value_set_source=UNDEFINED, value_set=UNDEFINED, value_range=UNDEFINED, script_lang=UNDEFINED, deprecated=UNDEFINED, position=UNDEFINED, context=UNDEFINED, registry=OP_REGISTRY, **properties)

op_input is a decorator function that provides meta-information for an operation input identified by input_name. If the decorated function or class is not registered as an operation yet, it is added to the default operation registry or the one given by registry, if any.

When a function is registered, an introspection is performed. During this process, initial operation meta-information input properties are derived for each positional and keyword argument named input_name:

Derived property

Source

position

The position of a positional argument, e.g. 2 for input z in def f(x, y, z, c=2).

default_value

The value of a keyword argument, e.g. 52.3 for input latitude from argument definition latitude:float=52.3

data_type

The type annotation type, e.g. float for input latitude from argument definition latitude:float

The derived properties listed above plus any of value_set, value_range, and any key-value pairs in properties are added to the input’s meta-information. A key-value pair in properties will always overwrite the derived properties listed above.

Parameters:
  • input_name – The name of an input.

  • default_value – A default value.

  • units – The geo-physical units of the input value.

  • data_type – The data type of the input values. If not given, the type of any given, non-None default_value is used.

  • nullable – If True, the value of the input may be None. If not given, it will be set to True if the default_value is None.

  • value_set_source – The name of an input, which can be used to generate a dynamic value set.

  • value_set – A sequence of the valid values. Note that all values in this sequence must be compatible with data_type.

  • value_range – A sequence specifying the possible range of valid values.

  • script_lang – The programming language for a parameter of data_type “str” that provides source code of a script, e.g. “python”.

  • deprecated – An optional boolean or a string. If a string is used, it should explain why the input has been deprecated and which new input to use instead. If set to True, the input’s doc-string should explain the deprecation.

  • position – The zero-based position of an input.

  • context – If True, the value of the operation input will be a dictionary representing the current execution context. If context is a string, the value of the operation input will be the result of evaluating the string as Python expression with the current execution context as local environment. This means, context may be an expression such as ‘value_cache’, ‘workspace.base_dir’, ‘step’, ‘step.id’.

  • properties – Other properties (keyword arguments) that will be added to the meta-information of the named output.

  • registry – Optional operation registry.

esa_climate_toolbox.core.op_output(output_name: str, data_type=UNDEFINED, deprecated=UNDEFINED, registry=OP_REGISTRY, **properties)

op_output is a decorator function that provides meta-information for an operation output identified by output_name. If the decorated function or class is not registered as an operation yet, it is added to the default operation registry or the one given by registry, if any.

If your function does not return multiple named outputs, use the op_return() decorator function. Note that:

@op_return(...)
def my_func(...):
    ...

if equivalent to:

@op_output('return', ...)
def my_func(...):
    ...

To automatically add information about the ESA Climate Toolbox, its version, this operation and its inputs, to this output, set ‘add_history’ to True:

@op_output('name', add_history=True)

Note that the operation should have version information added to it when add_history is True:

@op(version='X.x')
Parameters:
  • output_name – The name of the output.

  • data_type – The data type of the output value.

  • deprecated – An optional boolean or a string. If a string is used, it should explain why the output has been deprecated and which new output to use instead. If set to True, the output’s doc-string should explain the deprecation.

  • properties – Other properties (keyword arguments) that will be added to the meta-information of the named output.

  • registry – Optional operation registry.

esa_climate_toolbox.core.op_return(data_type=UNDEFINED, registry=OP_REGISTRY, **properties)

op_return is a decorator function that provides meta-information for a single, anonymous operation return value (whose output name is "return"). If the decorated function or class is not registered as an operation yet, it is added to the default operation registry or the one given by registry, if any. Any other keywords arguments in properties are added to the output’s meta-information.

When a function is registered, an introspection is performed. During this process, initial operation meta-information output properties are derived from the function’s return type annotation, that is data_type will be, e.g., float if a function is annotated as def f(x, y) -> float: ....

The derived data_type property and any key-value pairs in properties are added to the output’s meta-information. A key-value pair in properties will always overwrite a derived data_type.

If your function returns multiple named outputs, use the op_output() decorator function. Note that:

@op_return(...)
def my_func(...):
    ...

if equivalent to:

@op_output('return', ...)
def my_func(...):
    ...

To automatically add information about the ESA Climate Toolbox, its version, this operation and its inputs, to this output, set ‘add_history’ to True:

@op_return(add_history=True)

Note that the operation should have version information added to it when add_history is True:

@op(version='X.x')
Parameters:
  • data_type – The data type of the return value.

  • properties – Other properties (keyword arguments) that will be added to the meta-information of the return value.

  • registry – The operation registry.

esa_climate_toolbox.core.new_expression_op(op_meta_info: OpMetaInfo, expression: str) Operation

Create an operation that wraps a Python expression.

Parameters:
  • op_meta_info – Meta-information about the resulting operation and the operation’s inputs and outputs.

  • expression – The Python expression. May refer to any name given in op_meta_info.input.

Returns:

The Python expression wrapped into an operation.

esa_climate_toolbox.core.new_subprocess_op(op_meta_info: OpMetaInfo, command_pattern: str, run_python: bool = False, cwd: str | None = None, env: Dict[str, str] = None, shell: bool = False, started: str | Callable = None, progress: str | Callable = None, done: str | Callable = None) Operation

Create an operation for a child program run in a new process.

Parameters:
  • op_meta_info – Meta-information about the resulting operation and the operation’s inputs and outputs.

  • command_pattern – A pattern that will be interpolated to obtain the actual command to be executed. May contain “{input_name}” fields which will be replaced by the actual input value converted to text. input_name must refer to a valid operation input name in op_meta_info.input or it must be the value of either the “write_to” or “read_from” property of another input’s property map.

  • run_python – If True, command_pattern refers to a Python script which will be executed with the Python interpreter that the Climate Toolbox uses.

  • cwd – Current working directory to run the command line in.

  • env – Environment variables passed to the shell that executes the command line.

  • shell – Whether to use the shell as the program to execute.

  • started – Either a callable that receives a text line from the executable’s stdout and returns a tuple (label, total_work) or a regex that must match in order to signal the start of progress monitoring. The regex must provide the group names “label” or “total_work” or both, e.g. “(?P<label>w+)” or “(?P<total_work>d+)”

  • progress – Either a callable that receives a text line from the executable’s stdout and returns a tuple (work, msg) or a regex that must match in order to signal process. The regex must provide group names “work” or “msg” or both, e.g., “(?P<msg>w+)” or “(?P<work>d+)”

  • done – Either a callable that receives a text line a text line from the executable’s stdout and returns True or False or a regex that must match in order to signal the end of progress monitoring.

Returns:

The executable wrapped into an operation.

5.3.1. Managing Operations

esa_climate_toolbox.core.get_op(op_name: str, op_registry: OpRegistry = OP_REGISTRY) Operation

Returns an operation.

Parameters:
  • op_name – The name of the operation.

  • op_registry – An optional OpRegistry, in case the default one should not be used.

Returns:

An operation which may directly be called

esa_climate_toolbox.core.get_op_meta_info(op_name: str, op_registry: OpRegistry = OP_REGISTRY) Dict

Returns meta information about an operation.

Parameters:
  • op_name – The name of the operation for which meta information shall be provided.

  • op_registry – An optional OpRegistry, in case the default one should not be used.

Returns:

A dictionary representation of an operator’s meta info, providing information about input parameters and the expected output.

esa_climate_toolbox.core.list_operations(op_registry: OpRegistry = OP_REGISTRY, include_qualified_name: bool = False)

Lists the operations that are provided by the ESA Climate Toolbox.

Parameters:
  • op_registry – An optional OpRegistry, in case the default one should not be used.

  • include_qualified_name – If true, a more expressive qualified name will be returned along with the method name. Default is false.

Returns:

Either a list of the names of operations, or a list of tuples, each consisting of the operation name and a qualified name

5.4. Task Monitoring API

class esa_climate_toolbox.core.Monitor

A monitor is used to both observe and control a running task.

The Monitor class is an abstract base class for concrete monitors. Derived classes must implement the following three abstract methods: start(), progress(), and done(). Derived classes must implement also the following two abstract methods, if they want cancellation support: cancel() and is_cancelled().

Pass Monitor.NONE to functions that expect a monitor instead of passing None.

Given here is an example of how progress monitors should be used by functions::

def long_running_task(a, b, c, monitor):
    with monitor.starting('doing a long running task', total_work=100)
        # do 30% of the work here
        monitor.progress(work=30)
        # do 70% of the work here
        monitor.progress(work=70)

If a function makes calls to other functions that also support a monitor, a child-monitor is used::

def long_running_task(a, b, c, monitor):
    with monitor.starting('doing a long running task', total_work=100)
        # let other_task do 30% of the work
        other_task(a, b, c, monitor=monitor.child(work=30))
        # let other_task do 70% of the work
        other_task(a, b, c, monitor=monitor.child(work=70))
cancel()

Request the task to be cancelled. This method will be usually called from the code that created the monitor, not by users of the monitor. For example, a GUI could create the monitor due to an invocation of a long-running task, and then the user wishes to cancel that task. The default implementation does nothing. Override to implement something useful.

check_for_cancellation()

Checks if the monitor has been cancelled and raises a Cancellation in that case.

child(work: float = 1) Monitor

Return a child monitor for the given partial amount of work.

Parameters:

work – The partial amount of work.

Returns:

a sub-monitor

abstract done()

Call to signal that a task has been done.

is_cancelled() bool

Check if there is an external request to cancel the current task observed by this monitor.

Users of a monitor shall frequently call this method and check its return value. If cancellation is requested, they should politely exit the current processing in a proper way, e.g., by cleaning up allocated resources. The default implementation returns False. Subclasses shall override this method to return True if a task cancellation request was detected.

Returns:

True if task cancellation was requested externally. The default implementation returns False.

observing(label: str)

A context manager for easier use of progress monitors. Observes a dask task and reports back to the monitor.

Parameters:

label – Passed to the monitor’s start method

Returns:

abstract progress(work: float = None, msg: str = None)

Call to signal that a task has mode some progress.

Parameters:
  • work – The incremental amount of work.

  • msg – A detail message.

abstract start(label: str, total_work: float = None)

Call to signal that a task has started.

Note that label and total_work are not passed to __init__, because they are usually not known at constructions time. It is the responsibility of the task to derive the appropriate values for these.

Parameters:
  • label – A task label

  • total_work – The total amount of work

starting(label: str, total_work: float = None)

A context manager for easier use of progress monitors. Calls the monitor’s start method with label and total_work. Will then take care of calling Monitor.done().

Parameters:
  • label – Passed to the monitor’s start method

  • total_work – Passed to the monitor’s start method

Returns:

class esa_climate_toolbox.core.ChildMonitor(parent_monitor: Monitor, partial_work: float)

A child monitor is responsible for a partial amount of work of a parent_monitor.

Parameters:
  • parent_monitor – the parent monitor

  • partial_work – the partial amount of work of parent_monitor.

cancel()

Request the task to be cancelled. This method will be usually called from the code that created the monitor, not by users of the monitor. For example, a GUI could create the monitor due to an invocation of a long-running task, and then the user wishes to cancel that task. The default implementation does nothing. Override to implement something useful.

done()

Call to signal that a task has been done.

is_cancelled() bool

Check if there is an external request to cancel the current task observed by this monitor.

Users of a monitor shall frequently call this method and check its return value. If cancellation is requested, they should politely exit the current processing in a proper way, e.g., by cleaning up allocated resources. The default implementation returns False. Subclasses shall override this method to return True if a task cancellation request was detected.

Returns:

True if task cancellation was requested externally. The default implementation returns False.

progress(work: float = None, msg: str = None)

Call to signal that a task has mode some progress.

Parameters:
  • work – The incremental amount of work.

  • msg – A detail message.

start(label: str, total_work: float = None)

Call to signal that a task has started.

Note that label and total_work are not passed to __init__, because they are usually not known at constructions time. It is the responsibility of the task to derive the appropriate values for these.

Parameters:
  • label – A task label

  • total_work – The total amount of work