Reference Guide for l2ss-py

l2ss-py

Coverage


develop: .. image:: https://github.com/podaac/l2ss-py/actions/workflows/build-pipeline.yml/badge.svg?branch=develop&event=push


main: .. image:: https://github.com/podaac/l2ss-py/actions/workflows/build-pipeline.yml/badge.svg?branch=main&event=push

Harmony service for subsetting L2 data. l2ss-py supports:

  • Spatial subsetting

    • Bounding box

    • Shapefile subsetting

    • GeoJSON subsetting

  • Temporal subsetting

  • Variable subsetting

If you would like to contribute to l2ss-py, refer to the contribution document.

How to test l2ss-py locally

Unit tests

There are comprehensive unit tests for l2ss-py. The tests can be run as follows:

poetry run pytest -m "not aws and not integration" tests/

You can generate coverage reports as follows:

poetry run pytest --junitxml=build/reports/pytest.xml --cov=podaac/ --cov-report=html -m "not aws and not integration" tests/

l2ss-py script

You can run l2ss-py on a single granule without using Harmony. In order to run this, the l2ss-py package must be installed in your current Python interpreter

$ l2ss-py --help
usage: run_subsetter.py [-h] [--bbox BBOX BBOX BBOX BBOX]
                        [--variables VARIABLES [VARIABLES ...]]
                        [--min-time MIN_TIME] [--max-time MAX_TIME] [--cut]
                        input_file output_file

Run l2ss-py

positional arguments:
  input_file            File to subset
  output_file           Output file

optional arguments:
  -h, --help            show this help message and exit
  --bbox BBOX BBOX BBOX BBOX
                        Bounding box in the form min_lon min_lat max_lon
                        max_lat
  --variables VARIABLES [VARIABLES ...]
                        Variables, only include if variable subset is desired.
                        Should be a space separated list of variable names
                        e.g. sst wind_dir sst_error ...
  --min-time MIN_TIME   Min time. Should be ISO-8601 format. Only include if
                        temporal subset is desired.
  --max-time MAX_TIME   Max time. Should be ISO-8601 format. Only include if
                        temporal subset is desired.
  --cut                 If provided, scanline will be cut
  --shapefile SHAPEFILE
                        Path to either shapefile or geojson file used to subset the provided input granule

For example:

l2ss-py /path/to/input.nc /path/to/output.nc --bbox -50 -10 50 10 --variables wind_speed wind_dir ice_age time --min-time '2015-07-02T09:00:00' --max-time '2015-07-02T10:00:00' --cut

An addition to providing a bounding box, spatial subsetting can be achieved by passing in a shapefile or a geojson file.

```shell script poetry run l2ss-py /path/to/input.nc /path/to/output.nc –shapefile /path/to/test.shp

or

```shell script
poetry run l2ss-py /path/to/input.nc /path/to/output.nc --shapefile /path/to/test.geojson

Running Harmony locally

In order to fully test l2ss-py with Harmony, you can run Harmony locally. This requires the data exists in UAT Earthdata Cloud.

  1. Set up local Harmony instance. Instructions here

  2. Add concept ID for your data to services.yml

  3. Execute a local Harmony l2ss-py request. For example:

    localhost:3000/YOUR_COLLECTION_ID/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?format=application%2Fx-netcdf4&subset=lat(-10%3A10)&subset=lon(-10%3A10)&maxResults=2
    

Module Documentation

subset.py

Functions related to subsetting a NetCDF file.

podaac.subsetter.subset.apply_scale_offset(scale, offset, value)[source]

Apply scale and offset to the given value

podaac.subsetter.subset.build_temporal_cond(min_time, max_time, dataset, time_var_name)[source]

Build the temporal condition used in the xarray ‘where’ call which drops data not in the given bounds. If the data in the time var is of type ‘datetime’, assume this is a normal case where the time var uses the epoch from the ‘units’ metadata attribute to get epoch. If the data in the time var is of type ‘timedelta’, the epoch var is needed to calculate the datetime.

Parameters
  • min_time (str) – ISO timestamp representing the lower temporal bound

  • max_time (str) – ISO timestamp representing the upper temporal bound

  • dataset (xr.Dataset) – Dataset to build the condition off of

  • time_var_name (str) – Name of the time variable

Returns

If temporally subsetted, returns a boolean ND-array the shape of which matches the dimensions of the coordinate vars. ‘True’ is essentially a noop.

Return type

np.array or boolean

podaac.subsetter.subset.calculate_chunks(dataset)[source]

For the given dataset, calculate if the size on any dimension is worth chunking. Any dimension larger than 4000 will be chunked. This is done to ensure that the variable can fit in memory.

Parameters

dataset (xarray.Dataset) – The dataset to calculate chunks for.

Returns

The chunk dictionary, where the key is the dimension and the value is 4000 or 500 depending on how many dimensions.

Return type

dict

podaac.subsetter.subset.compute_coordinate_variable_names(dataset)[source]

Given a dataset, determine the coordinate variable from a list of options

Parameters

dataset (xr.Dataset) – The dataset to find the coordinate variables for

Returns

Tuple of strings, where the first element is the lat coordinate name and the second element is the lon coordinate name

Return type

tuple, str

podaac.subsetter.subset.compute_time_variable_name(dataset, lat_var)[source]

Try to determine the name of the ‘time’ variable. This is done as follows:

  • The variable name contains ‘time’

  • The variable dimensions match the dimensions of the given lat var

Parameters
  • dataset (xr.Dataset:) – xarray dataset to find time variable from

  • lat_var (xr.Variable) – Lat variable for this dataset

Returns

The name of the variable

Return type

str

Raises

ValueError – If the time variable could not be determined

podaac.subsetter.subset.convert_bbox(bbox, dataset, lat_var_name, lon_var_name)[source]

This function will return a converted bbox which matches the range of the given input file. This will convert both the latitude and longitude range. For example, an input dataset can have a valid longitude range of -180 –> 180 or of 0 –> 360.

Parameters
  • bbox (np.array) – The bounding box

  • dataset (xarray.Dataset) – The dataset which is being subset.

  • lat_var_name (str) – Name of the lat variable in the given dataset

  • lon_var_name (str) – Name of the lon variable in the given dataset

Returns

bbox – The new bbox which matches latitude and longitude ranges of the input file.

Return type

np.array

Notes

Assumption that the provided bounding box is always between -180 –> 180 for longitude and -90, 90 for latitude.

podaac.subsetter.subset.convert_bound(bound, coord_max, coord_var)[source]

This function will return a converted bound which which matches the range of the given input file.

Parameters
  • bound (np.array) – 1-dimensional 2-element numpy array which represent the lower and upper bounding box on this coordinate, respectively.

  • coord_max (integer) – The max value which is possible given this coordinate. For example, the max for longitude is 360.

  • coord_var (xarray.DataArray) – The xarray variable for some coordinate.

Returns

1-dimensional 2-element number array which represents the lower and upper bounding box on this coordinate and has been converted based on the valid bounds coordinate range of the dataset.

Return type

np.array

Notes

Assumption that 0 is always on the prime meridian/equator.

podaac.subsetter.subset.datetime_from_mjd(dataset, time_var_name)[source]

Translate the modified julian date from the long name in the time attribute.

Parameters
  • dataset (xr.Dataset) – Dataset that contains time var

  • time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)

Returns

the datetime of the modified julian date

Return type

datetime

podaac.subsetter.subset.find_matching_coords(dataset, match_list)[source]

As a backup for finding a coordinate var, look at the ‘coordinates’ metadata attribute of all data vars in the granule. Return any coordinate vars that have name matches with the provided ‘match_list’

Parameters
  • dataset (xr.Dataset) – Dataset to search data variable coordinate metadata attribute

  • match_list (list (str)) – List of possible matches to search for. For example, [‘lat’, ‘latitude’] would search for variables in the ‘coordinates’ metadata attribute containing either ‘lat’ or ‘latitude’

Returns

List of matching coordinate variables names

Return type

list (str)

podaac.subsetter.subset.get_coordinate_variable_names(dataset, lat_var_names=None, lon_var_names=None, time_var_names=None)[source]

Retrieve coordinate variables for this dataset. If coordinate variables are provided, use those, Otherwise, attempt to determine coordinate variables manually.

Parameters
  • dataset (xr.Dataset) – xarray Dataset used to compute coordinate variables manually. Only used if lat, lon, or time vars are not provided.

  • lat_var_names (list) – List of latitude coordinate variables.

  • lon_var_names (list) – List of longitude coordinate variables.

  • time_var_names (list) – List of time coordinate variables.

podaac.subsetter.subset.get_spatial_bounds(dataset, lat_var_names, lon_var_names)[source]

Get the spatial bounds for this dataset. These values are masked and scaled.

Parameters
  • dataset (xr.Dataset) – Dataset to retrieve spatial bounds for

  • lat_var_name (str) – Name of the lat variable

  • lon_var_name (str) – Name of the lon variable

Returns

[[lon min, lon max], [lat min, lat max]]

Return type

np.array

podaac.subsetter.subset.get_time_epoch_var(dataset, time_var_name)[source]

Get the name of the epoch time var. This is only needed in the case where there is a single time var (of size 1) that contains the time epoch used by the actual time var.

Parameters
  • dataset (xr.Dataset) – Dataset that contains time var

  • time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)

Returns

The name of the epoch time variable

Return type

str

podaac.subsetter.subset.h5file_transform(finput)[source]

Transform a h5py Dataset that has groups to an xarray compatible dataset. xarray does not work with groups, so this transformation will flatten the variables in the dataset and use the group path as the new variable name. For example, data_01 > km > sst would become ‘data_01__km__sst’, where GROUP_DELIM is __.

Returns

netCDF4 Dataset that does not contain groups and that has been flattened.

Return type

nc.Dataset

podaac.subsetter.subset.is_360(lon_var, scale, offset)[source]

Determine if given dataset is a ‘360’ dataset or not.

Parameters
  • lon_var (xr.DataArray) – The lon variable from the xarray Dataset

  • scale (float) – Used to remove scale and offset for easier calculation

  • offset (float) – Used to remove scale and offset for easier calculation

Returns

True if dataset is 360, False if not. Defaults to False.

Return type

bool

podaac.subsetter.subset.is_time_mjd(dataset, time_var_name)[source]

Check to see if the time format is a time delta from a modified julian date.

Parameters
  • dataset (xr.Dataset) – Dataset that contains time var

  • time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)

Returns

is time delta format in modified julian date

Return type

boolean

podaac.subsetter.subset.recombine_grouped_datasets(datasets, output_file)[source]

Given a list of xarray datasets, combine those datasets into a single netCDF4 Dataset and write to the disk. Each dataset has been transformed using its group path and needs to be un-transformed and placed in the appropriate group.

Parameters
  • datasets (list (xr.Dataset)) – List of xarray datasets to be combined

  • output_file (str) – Name of the output file to write the resulting NetCDF file to.

podaac.subsetter.subset.remove_scale_offset(value, scale, offset)[source]

Remove scale and offset from the given value

podaac.subsetter.subset.set_json_history(dataset, cut, file_to_subset, bbox=None, shapefile=None, origin_source=None)[source]

Set the ‘json_history’ metadata header of the granule to reflect the current version of the subsetter, as well as the parameters used to call the subsetter. This will append an json array to the json_history of the following format:

Parameters
  • dataset (xarray.Dataset) – The dataset to change the header of

  • bbox (np.ndarray) – The requested bounding box

  • file_to_subset (string) – The filepath of the file which was used to subset

  • cut (boolean) – True to cut the scanline

  • shapefile (str) – Name of the shapefile to include in the version history

podaac.subsetter.subset.set_version_history(dataset, cut, bbox=None, shapefile=None)[source]

Set the ‘history’ metadata header of the granule to reflect the current version of the subsetter, as well as the parameters used to call the subsetter. This will append a line to the history of the following format:

TIMESTAMP podaac.subsetter VERSION (PARAMS)

Parameters
  • dataset (xarray.Dataset) – The dataset to change the header of

  • bbox (np.ndarray) – The requested bounding box

  • cut (boolean) – True to cut the scanline

  • shapefile (str) – Name of the shapefile to include in the version history

podaac.subsetter.subset.subset(file_to_subset, bbox, output_file, variables=None, cut=True, shapefile=None, min_time=None, max_time=None, origin_source=None, lat_var_names=None, lon_var_names=None, time_var_names=None)[source]

Subset a given NetCDF file given a bounding box

Parameters
  • file_to_subset (string) – The location of the file which will be subset

  • output_file (string) – The file path for the output of the subsetting operation.

  • bbox (np.ndarray) – The chosen bounding box. This is a tuple of tuples formatted as such: ((west, east), (south, north)). The assumption is that the valid range is ((-180, 180), (-90, 90)). This will be transformed as appropriate if the actual longitude range is 0-360.

  • shapefile (str) – Name of local shapefile used to subset given file.

  • variables (list, str, optional) – List of variables to include in the resulting data file. NOTE: This will remove ALL variables which are not included in this list, including coordinate variables!

  • cut (boolean) – True if the scanline should be cut, False if the scanline should not be cut. Defaults to True.

  • min_time (str) – ISO timestamp representing the lower bound of the temporal subset to be performed. If this value is not provided, the granule will not be subset temporally on the lower bound.

  • max_time (str) – ISO timestamp representing the upper bound of the temporal subset to be performed. If this value is not provided, the granule will not be subset temporally on the upper bound.

  • lat_var_names (list) – List of variables that represent the latitude coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.

  • lon_var_names (list) – List of variables that represent the longitude coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.

  • time_var_names (list) – List of variables that represent the time coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.

podaac.subsetter.subset.subset_with_bbox(dataset, lat_var_names, lon_var_names, time_var_names, variables=None, bbox=None, cut=True, min_time=None, max_time=None)[source]

Subset an xarray Dataset using a spatial bounding box.

Parameters
  • dataset (xr.Dataset) – Dataset to subset

  • lat_var_names (list) – Name of the latitude variables in the given dataset

  • lon_var_names (list) – Name of the longitude variables in the given dataset

  • time_var_names (list) – Name of the time variables in the given dataset

  • bbox (np.array) – Spatial bounding box to subset Dataset with.

  • cut (bool) – True if scanline should be cut.

  • min_time (str) – ISO timestamp of min temporal bound

  • max_time (str) – ISO timestamp of max temporal bound

Returns

Spatial bounds of Dataset after subset operation

Return type

np.array

podaac.subsetter.subset.subset_with_shapefile(dataset, lat_var_name, lon_var_name, shapefile, cut, chunks)[source]

Subset an xarray Dataset using a shapefile

Parameters
  • dataset (xr.Dataset) – Dataset to subset

  • lat_var_name (str) – Name of the latitude variable in the given dataset

  • lon_var_name (str) – Name of the longitude variable in the given dataset

  • shapefile (str) – Absolute path to the shapefile used to subset the given dataset

  • cut (bool) – True if scanline should be cut.

Returns

Spatial bounds of Dataset after shapefile subset operation

Return type

np.array

podaac.subsetter.subset.transform_grouped_dataset(nc_dataset, file_to_subset)[source]

Transform a netCDF4 Dataset that has groups to an xarray compatible dataset. xarray does not work with groups, so this transformation will flatten the variables in the dataset and use the group path as the new variable name. For example, data_01 > km > sst would become ‘data_01__km__sst’, where GROUP_DELIM is __.

This same pattern is applied to dimensions, which are located under the appropriate group. They are renamed and placed in the root group.

Parameters

nc_dataset (nc.Dataset) – netCDF4 Dataset that contains groups

Returns

netCDF4 Dataset that does not contain groups and that has been flattened.

Return type

nc.Dataset

podaac.subsetter.subset.translate_timestamp(str_timestamp)[source]

Translate timestamp to datetime object

Parameters

str_timestamp (str) – Timestamp string. ISO or RFC

Returns

Constructed Datetime object

Return type

datetime

subset_harmony.py

Implementation of harmony-service-lib that invokes the Level 2 subsetter.

class podaac.subsetter.subset_harmony.L2SubsetterService(message, catalog=None, config=None)[source]

See https://github.com/nasa/harmony-service-lib-py for documentation and examples.

__init__(message, catalog=None, config=None)[source]

Constructs the adapter

Parameters
  • message (harmony.Message) – The Harmony input which needs acting upon

  • catalog (pystac.Catalog) – A STAC catalog containing the files on which to act

  • config (harmony.util.Config) – The configuration values for this runtime environment.

cmd(*args)[source]

Logs and then runs command.

Parameters

run (args Command and args to) –

Return type

Command output

prepare_output_dir(output_dir)[source]

Deletes (if present) and recreates the given output_dir, ensuring it exists and is empty

Parameters

output_dir (string) – the directory to delete and recreate

process_item(item, source)[source]

Performs variable and bounding box subsetting on the input STAC Item’s data, returning an output STAC item

Parameters
  • item (pystac.Item) – the item that should be subset

  • source (harmony.message.Source) – the input source defining the variables, if any, to subset from the item

Returns

a STAC item describing the output of the subsetter

Return type

pystac.Item

podaac.subsetter.subset_harmony.harmony_to_podaac_bbox(bbox)[source]

Convert Harmony bbox (west, south, east, north) to PO.DAAC bbox ((west, east), (south, north))

Parameters

bbox (array) – Harmony bbox

Returns

PO.DAAC bbox

Return type

np.array

podaac.subsetter.subset_harmony.main(config=None)[source]

Parse command line arguments and invoke the service to respond to them.

Parameters

config (harmony.util.Config) –

Return type

None

podaac.subsetter.subset_harmony.podaac_to_harmony_bbox(bbox)[source]

Convert PO.DAAC bbox ((west, east), (south, north)) to Harmony bbox (west, south, east, north)

Parameters

bbox (np.array) – Podaac bbox

Returns

Harmony bbox

Return type

array, int or float

run_subsetter.py

This script runs L2SS-Py on the given granule.

podaac.subsetter.run_subsetter.main()[source]

Entry point to the script

podaac.subsetter.run_subsetter.parse_args(args)[source]

Parse args for this script.

Returns

input_file, output_file, bbox, variables, min_time, max_time

Return type

tuple

podaac.subsetter.run_subsetter.run_subsetter(args)[source]

Parse arguments and run subsetter on the specified input file

xarray_enhancements.py

Functions which improve upon existing xarray functionality, optimized for this specific use-case.

podaac.subsetter.xarray_enhancements.cast_type(var, var_type)[source]

Type cast a variable into a var type.

Parameters
  • var (xarray.core.dataarray.DataArray) – The dataarray to be type casted.

  • var_type (string) – New type the variable will be type casted to.

Returns

The newly type casted variable.

Return type

xarray.core.dataarray.DataArray

podaac.subsetter.xarray_enhancements.copy_empty_dataset(dataset)[source]
Copy an dataset into a new, empty dataset. This dataset should:
  • Contain the same structure as the input dataset (only include requested variables, if variable subset)

  • Contain the same global metadata as the input dataset

  • Contain a history field which describes this subset operation.

Parameters

dataset (xarray.Dataset) – The dataset to copy into a empty dataset.

Returns

The new dataset which has no data.

Return type

xarray.Dataset

podaac.subsetter.xarray_enhancements.get_indexers_from_1d(cond)[source]

Get indexers from a dataset with 1 dimension.

Parameters

cond (xarray.Dataset) – Contains the result of the initial lat lon condition.

Returns

Indexer dictionary for the provided condition.

Return type

dict

podaac.subsetter.xarray_enhancements.get_indexers_from_nd(cond, cut)[source]

Get indexers from a dataset with more than 1 dimensions.

Parameters
  • cond (xarray.Dataset) – Contains the result of the initial lat lon condition.

  • cut (bool) – True if the scanline should be cut.

Returns

Indexer dictionary for the provided condition.

Return type

dict

podaac.subsetter.xarray_enhancements.where(dataset, cond, cut)[source]

Return a dataset which meets the given condition.

This is a modification of the existing xarray ‘where’ function. https://github.com/pydata/xarray/blob/master/xarray/core/common.py#L999

Parameters
  • dataset (xarray.Dataset) – The dataset to filter and return.

  • cond (DataArray or Dataset with boolean dtype) – Locations at which to preserve this object’s values.

  • cut (boolean) – True if the scanline should be cut, False if the scanline should not be cut.

Returns

The filtered Dataset

Return type

xarray.Dataset

Notes

The cond variable contains a boolean mask of valid data indices. However in that mask, True represents valid data and False represents invalid data.

Indices and tables