Reference Guide for l2ss-py¶
l2ss-py¶
develop:
.. image:: https://github.com/podaac/l2ss-py/actions/workflows/build-pipeline.yml/badge.svg?branch=develop&event=push
- target
https://github.com/podaac/l2ss-py/actions/workflows/build-pipeline.yml
- alt
Develop Build
main:
.. image:: https://github.com/podaac/l2ss-py/actions/workflows/build-pipeline.yml/badge.svg?branch=main&event=push
- target
https://github.com/podaac/l2ss-py/actions/workflows/build-pipeline.yml
- alt
Main Build
Harmony service for subsetting L2 data. l2ss-py supports:
Spatial subsetting
Bounding box
Shapefile subsetting
GeoJSON subsetting
Temporal subsetting
Variable subsetting
If you would like to contribute to l2ss-py, refer to the contribution document.
How to test l2ss-py locally¶
Unit tests¶
There are comprehensive unit tests for l2ss-py. The tests can be run as follows:
poetry run pytest -m "not aws and not integration" tests/
You can generate coverage reports as follows:
poetry run pytest --junitxml=build/reports/pytest.xml --cov=podaac/ --cov-report=html -m "not aws and not integration" tests/
l2ss-py script¶
You can run l2ss-py on a single granule without using Harmony. In order to run this, the l2ss-py package must be installed in your current Python interpreter
$ l2ss-py --help
usage: run_subsetter.py [-h] [--bbox BBOX BBOX BBOX BBOX]
[--variables VARIABLES [VARIABLES ...]]
[--min-time MIN_TIME] [--max-time MAX_TIME] [--cut]
input_file output_file
Run l2ss-py
positional arguments:
input_file File to subset
output_file Output file
optional arguments:
-h, --help show this help message and exit
--bbox BBOX BBOX BBOX BBOX
Bounding box in the form min_lon min_lat max_lon
max_lat
--variables VARIABLES [VARIABLES ...]
Variables, only include if variable subset is desired.
Should be a space separated list of variable names
e.g. sst wind_dir sst_error ...
--min-time MIN_TIME Min time. Should be ISO-8601 format. Only include if
temporal subset is desired.
--max-time MAX_TIME Max time. Should be ISO-8601 format. Only include if
temporal subset is desired.
--cut If provided, scanline will be cut
--shapefile SHAPEFILE
Path to either shapefile or geojson file used to subset the provided input granule
For example:
l2ss-py /path/to/input.nc /path/to/output.nc --bbox -50 -10 50 10 --variables wind_speed wind_dir ice_age time --min-time '2015-07-02T09:00:00' --max-time '2015-07-02T10:00:00' --cut
An addition to providing a bounding box, spatial subsetting can be achieved by passing in a shapefile or a geojson file.
```shell script poetry run l2ss-py /path/to/input.nc /path/to/output.nc –shapefile /path/to/test.shp
or
```shell script
poetry run l2ss-py /path/to/input.nc /path/to/output.nc --shapefile /path/to/test.geojson
Running Harmony locally¶
In order to fully test l2ss-py with Harmony, you can run Harmony locally. This requires the data exists in UAT Earthdata Cloud.
Set up local Harmony instance. Instructions here
Add concept ID for your data to services.yml
Execute a local Harmony l2ss-py request. For example:
localhost:3000/YOUR_COLLECTION_ID/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?format=application%2Fx-netcdf4&subset=lat(-10%3A10)&subset=lon(-10%3A10)&maxResults=2
Module Documentation¶
subset.py¶
Functions related to subsetting a NetCDF file.
- podaac.subsetter.subset.apply_scale_offset(scale, offset, value)[source]¶
Apply scale and offset to the given value
- podaac.subsetter.subset.build_temporal_cond(min_time, max_time, dataset, time_var_name)[source]¶
Build the temporal condition used in the xarray ‘where’ call which drops data not in the given bounds. If the data in the time var is of type ‘datetime’, assume this is a normal case where the time var uses the epoch from the ‘units’ metadata attribute to get epoch. If the data in the time var is of type ‘timedelta’, the epoch var is needed to calculate the datetime.
- Parameters
min_time (str) – ISO timestamp representing the lower temporal bound
max_time (str) – ISO timestamp representing the upper temporal bound
dataset (xr.Dataset) – Dataset to build the condition off of
time_var_name (str) – Name of the time variable
- Returns
If temporally subsetted, returns a boolean ND-array the shape of which matches the dimensions of the coordinate vars. ‘True’ is essentially a noop.
- Return type
np.array or boolean
- podaac.subsetter.subset.calculate_chunks(dataset)[source]¶
For the given dataset, calculate if the size on any dimension is worth chunking. Any dimension larger than 4000 will be chunked. This is done to ensure that the variable can fit in memory.
- Parameters
dataset (xarray.Dataset) – The dataset to calculate chunks for.
- Returns
The chunk dictionary, where the key is the dimension and the value is 4000 or 500 depending on how many dimensions.
- Return type
dict
- podaac.subsetter.subset.compute_coordinate_variable_names(dataset)[source]¶
Given a dataset, determine the coordinate variable from a list of options
- Parameters
dataset (xr.Dataset) – The dataset to find the coordinate variables for
- Returns
Tuple of strings, where the first element is the lat coordinate name and the second element is the lon coordinate name
- Return type
tuple, str
- podaac.subsetter.subset.compute_time_variable_name(dataset, lat_var)[source]¶
Try to determine the name of the ‘time’ variable. This is done as follows:
The variable name contains ‘time’
The variable dimensions match the dimensions of the given lat var
- Parameters
dataset (xr.Dataset:) – xarray dataset to find time variable from
lat_var (xr.Variable) – Lat variable for this dataset
- Returns
The name of the variable
- Return type
str
- Raises
ValueError – If the time variable could not be determined
- podaac.subsetter.subset.convert_bbox(bbox, dataset, lat_var_name, lon_var_name)[source]¶
This function will return a converted bbox which matches the range of the given input file. This will convert both the latitude and longitude range. For example, an input dataset can have a valid longitude range of -180 –> 180 or of 0 –> 360.
- Parameters
bbox (np.array) – The bounding box
dataset (xarray.Dataset) – The dataset which is being subset.
lat_var_name (str) – Name of the lat variable in the given dataset
lon_var_name (str) – Name of the lon variable in the given dataset
- Returns
bbox – The new bbox which matches latitude and longitude ranges of the input file.
- Return type
np.array
Notes
Assumption that the provided bounding box is always between -180 –> 180 for longitude and -90, 90 for latitude.
- podaac.subsetter.subset.convert_bound(bound, coord_max, coord_var)[source]¶
This function will return a converted bound which which matches the range of the given input file.
- Parameters
bound (np.array) – 1-dimensional 2-element numpy array which represent the lower and upper bounding box on this coordinate, respectively.
coord_max (integer) – The max value which is possible given this coordinate. For example, the max for longitude is 360.
coord_var (xarray.DataArray) – The xarray variable for some coordinate.
- Returns
1-dimensional 2-element number array which represents the lower and upper bounding box on this coordinate and has been converted based on the valid bounds coordinate range of the dataset.
- Return type
np.array
Notes
Assumption that 0 is always on the prime meridian/equator.
- podaac.subsetter.subset.datetime_from_mjd(dataset, time_var_name)[source]¶
Translate the modified julian date from the long name in the time attribute.
- Parameters
dataset (xr.Dataset) – Dataset that contains time var
time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)
- Returns
the datetime of the modified julian date
- Return type
datetime
- podaac.subsetter.subset.find_matching_coords(dataset, match_list)[source]¶
As a backup for finding a coordinate var, look at the ‘coordinates’ metadata attribute of all data vars in the granule. Return any coordinate vars that have name matches with the provided ‘match_list’
- Parameters
dataset (xr.Dataset) – Dataset to search data variable coordinate metadata attribute
match_list (list (str)) – List of possible matches to search for. For example, [‘lat’, ‘latitude’] would search for variables in the ‘coordinates’ metadata attribute containing either ‘lat’ or ‘latitude’
- Returns
List of matching coordinate variables names
- Return type
list (str)
- podaac.subsetter.subset.get_coordinate_variable_names(dataset, lat_var_names=None, lon_var_names=None, time_var_names=None)[source]¶
Retrieve coordinate variables for this dataset. If coordinate variables are provided, use those, Otherwise, attempt to determine coordinate variables manually.
- Parameters
dataset (xr.Dataset) – xarray Dataset used to compute coordinate variables manually. Only used if lat, lon, or time vars are not provided.
lat_var_names (list) – List of latitude coordinate variables.
lon_var_names (list) – List of longitude coordinate variables.
time_var_names (list) – List of time coordinate variables.
- podaac.subsetter.subset.get_spatial_bounds(dataset, lat_var_names, lon_var_names)[source]¶
Get the spatial bounds for this dataset. These values are masked and scaled.
- Parameters
dataset (xr.Dataset) – Dataset to retrieve spatial bounds for
lat_var_name (str) – Name of the lat variable
lon_var_name (str) – Name of the lon variable
- Returns
[[lon min, lon max], [lat min, lat max]]
- Return type
np.array
- podaac.subsetter.subset.get_time_epoch_var(dataset, time_var_name)[source]¶
Get the name of the epoch time var. This is only needed in the case where there is a single time var (of size 1) that contains the time epoch used by the actual time var.
- Parameters
dataset (xr.Dataset) – Dataset that contains time var
time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)
- Returns
The name of the epoch time variable
- Return type
str
- podaac.subsetter.subset.h5file_transform(finput)[source]¶
Transform a h5py Dataset that has groups to an xarray compatible dataset. xarray does not work with groups, so this transformation will flatten the variables in the dataset and use the group path as the new variable name. For example, data_01 > km > sst would become ‘data_01__km__sst’, where GROUP_DELIM is __.
- Returns
netCDF4 Dataset that does not contain groups and that has been flattened.
- Return type
nc.Dataset
- podaac.subsetter.subset.is_360(lon_var, scale, offset)[source]¶
Determine if given dataset is a ‘360’ dataset or not.
- Parameters
lon_var (xr.DataArray) – The lon variable from the xarray Dataset
scale (float) – Used to remove scale and offset for easier calculation
offset (float) – Used to remove scale and offset for easier calculation
- Returns
True if dataset is 360, False if not. Defaults to False.
- Return type
bool
- podaac.subsetter.subset.is_time_mjd(dataset, time_var_name)[source]¶
Check to see if the time format is a time delta from a modified julian date.
- Parameters
dataset (xr.Dataset) – Dataset that contains time var
time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)
- Returns
is time delta format in modified julian date
- Return type
boolean
- podaac.subsetter.subset.recombine_grouped_datasets(datasets, output_file)[source]¶
Given a list of xarray datasets, combine those datasets into a single netCDF4 Dataset and write to the disk. Each dataset has been transformed using its group path and needs to be un-transformed and placed in the appropriate group.
- Parameters
datasets (list (xr.Dataset)) – List of xarray datasets to be combined
output_file (str) – Name of the output file to write the resulting NetCDF file to.
- podaac.subsetter.subset.remove_scale_offset(value, scale, offset)[source]¶
Remove scale and offset from the given value
- podaac.subsetter.subset.set_json_history(dataset, cut, file_to_subset, bbox=None, shapefile=None, origin_source=None)[source]¶
Set the ‘json_history’ metadata header of the granule to reflect the current version of the subsetter, as well as the parameters used to call the subsetter. This will append an json array to the json_history of the following format:
- Parameters
dataset (xarray.Dataset) – The dataset to change the header of
bbox (np.ndarray) – The requested bounding box
file_to_subset (string) – The filepath of the file which was used to subset
cut (boolean) – True to cut the scanline
shapefile (str) – Name of the shapefile to include in the version history
- podaac.subsetter.subset.set_version_history(dataset, cut, bbox=None, shapefile=None)[source]¶
Set the ‘history’ metadata header of the granule to reflect the current version of the subsetter, as well as the parameters used to call the subsetter. This will append a line to the history of the following format:
TIMESTAMP podaac.subsetter VERSION (PARAMS)
- Parameters
dataset (xarray.Dataset) – The dataset to change the header of
bbox (np.ndarray) – The requested bounding box
cut (boolean) – True to cut the scanline
shapefile (str) – Name of the shapefile to include in the version history
- podaac.subsetter.subset.subset(file_to_subset, bbox, output_file, variables=None, cut=True, shapefile=None, min_time=None, max_time=None, origin_source=None, lat_var_names=None, lon_var_names=None, time_var_names=None)[source]¶
Subset a given NetCDF file given a bounding box
- Parameters
file_to_subset (string) – The location of the file which will be subset
output_file (string) – The file path for the output of the subsetting operation.
bbox (np.ndarray) – The chosen bounding box. This is a tuple of tuples formatted as such: ((west, east), (south, north)). The assumption is that the valid range is ((-180, 180), (-90, 90)). This will be transformed as appropriate if the actual longitude range is 0-360.
shapefile (str) – Name of local shapefile used to subset given file.
variables (list, str, optional) – List of variables to include in the resulting data file. NOTE: This will remove ALL variables which are not included in this list, including coordinate variables!
cut (boolean) – True if the scanline should be cut, False if the scanline should not be cut. Defaults to True.
min_time (str) – ISO timestamp representing the lower bound of the temporal subset to be performed. If this value is not provided, the granule will not be subset temporally on the lower bound.
max_time (str) – ISO timestamp representing the upper bound of the temporal subset to be performed. If this value is not provided, the granule will not be subset temporally on the upper bound.
lat_var_names (list) – List of variables that represent the latitude coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.
lon_var_names (list) – List of variables that represent the longitude coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.
time_var_names (list) – List of variables that represent the time coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.
- podaac.subsetter.subset.subset_with_bbox(dataset, lat_var_names, lon_var_names, time_var_names, variables=None, bbox=None, cut=True, min_time=None, max_time=None)[source]¶
Subset an xarray Dataset using a spatial bounding box.
- Parameters
dataset (xr.Dataset) – Dataset to subset
lat_var_names (list) – Name of the latitude variables in the given dataset
lon_var_names (list) – Name of the longitude variables in the given dataset
time_var_names (list) – Name of the time variables in the given dataset
bbox (np.array) – Spatial bounding box to subset Dataset with.
cut (bool) – True if scanline should be cut.
min_time (str) – ISO timestamp of min temporal bound
max_time (str) – ISO timestamp of max temporal bound
- Returns
Spatial bounds of Dataset after subset operation
- Return type
np.array
- podaac.subsetter.subset.subset_with_shapefile(dataset, lat_var_name, lon_var_name, shapefile, cut, chunks)[source]¶
Subset an xarray Dataset using a shapefile
- Parameters
dataset (xr.Dataset) – Dataset to subset
lat_var_name (str) – Name of the latitude variable in the given dataset
lon_var_name (str) – Name of the longitude variable in the given dataset
shapefile (str) – Absolute path to the shapefile used to subset the given dataset
cut (bool) – True if scanline should be cut.
- Returns
Spatial bounds of Dataset after shapefile subset operation
- Return type
np.array
- podaac.subsetter.subset.transform_grouped_dataset(nc_dataset, file_to_subset)[source]¶
Transform a netCDF4 Dataset that has groups to an xarray compatible dataset. xarray does not work with groups, so this transformation will flatten the variables in the dataset and use the group path as the new variable name. For example, data_01 > km > sst would become ‘data_01__km__sst’, where GROUP_DELIM is __.
This same pattern is applied to dimensions, which are located under the appropriate group. They are renamed and placed in the root group.
- Parameters
nc_dataset (nc.Dataset) – netCDF4 Dataset that contains groups
- Returns
netCDF4 Dataset that does not contain groups and that has been flattened.
- Return type
nc.Dataset
subset_harmony.py¶
Implementation of harmony-service-lib that invokes the Level 2 subsetter.
- class podaac.subsetter.subset_harmony.L2SubsetterService(message, catalog=None, config=None)[source]¶
See https://github.com/nasa/harmony-service-lib-py for documentation and examples.
- __init__(message, catalog=None, config=None)[source]¶
Constructs the adapter
- Parameters
message (harmony.Message) – The Harmony input which needs acting upon
catalog (pystac.Catalog) – A STAC catalog containing the files on which to act
config (harmony.util.Config) – The configuration values for this runtime environment.
- cmd(*args)[source]¶
Logs and then runs command.
- Parameters
run (args Command and args to) –
- Return type
Command output
- prepare_output_dir(output_dir)[source]¶
Deletes (if present) and recreates the given output_dir, ensuring it exists and is empty
- Parameters
output_dir (string) – the directory to delete and recreate
- process_item(item, source)[source]¶
Performs variable and bounding box subsetting on the input STAC Item’s data, returning an output STAC item
- Parameters
item (pystac.Item) – the item that should be subset
source (harmony.message.Source) – the input source defining the variables, if any, to subset from the item
- Returns
a STAC item describing the output of the subsetter
- Return type
pystac.Item
- podaac.subsetter.subset_harmony.harmony_to_podaac_bbox(bbox)[source]¶
Convert Harmony bbox (west, south, east, north) to PO.DAAC bbox ((west, east), (south, north))
- Parameters
bbox (array) – Harmony bbox
- Returns
PO.DAAC bbox
- Return type
np.array
run_subsetter.py
This script runs L2SS-Py on the given granule.
- podaac.subsetter.run_subsetter.parse_args(args)[source]¶
Parse args for this script.
- Returns
input_file, output_file, bbox, variables, min_time, max_time
- Return type
tuple
- podaac.subsetter.run_subsetter.run_subsetter(args)[source]¶
Parse arguments and run subsetter on the specified input file
xarray_enhancements.py¶
Functions which improve upon existing xarray functionality, optimized for this specific use-case.
- podaac.subsetter.xarray_enhancements.cast_type(var, var_type)[source]¶
Type cast a variable into a var type.
- Parameters
var (xarray.core.dataarray.DataArray) – The dataarray to be type casted.
var_type (string) – New type the variable will be type casted to.
- Returns
The newly type casted variable.
- Return type
xarray.core.dataarray.DataArray
- podaac.subsetter.xarray_enhancements.copy_empty_dataset(dataset)[source]¶
- Copy an dataset into a new, empty dataset. This dataset should:
Contain the same structure as the input dataset (only include requested variables, if variable subset)
Contain the same global metadata as the input dataset
Contain a history field which describes this subset operation.
- Parameters
dataset (xarray.Dataset) – The dataset to copy into a empty dataset.
- Returns
The new dataset which has no data.
- Return type
xarray.Dataset
- podaac.subsetter.xarray_enhancements.get_indexers_from_1d(cond)[source]¶
Get indexers from a dataset with 1 dimension.
- Parameters
cond (xarray.Dataset) – Contains the result of the initial lat lon condition.
- Returns
Indexer dictionary for the provided condition.
- Return type
dict
- podaac.subsetter.xarray_enhancements.get_indexers_from_nd(cond, cut)[source]¶
Get indexers from a dataset with more than 1 dimensions.
- Parameters
cond (xarray.Dataset) – Contains the result of the initial lat lon condition.
cut (bool) – True if the scanline should be cut.
- Returns
Indexer dictionary for the provided condition.
- Return type
dict
- podaac.subsetter.xarray_enhancements.where(dataset, cond, cut)[source]¶
Return a dataset which meets the given condition.
This is a modification of the existing xarray ‘where’ function. https://github.com/pydata/xarray/blob/master/xarray/core/common.py#L999
- Parameters
dataset (xarray.Dataset) – The dataset to filter and return.
cond (DataArray or Dataset with boolean dtype) – Locations at which to preserve this object’s values.
cut (boolean) – True if the scanline should be cut, False if the scanline should not be cut.
- Returns
The filtered Dataset
- Return type
xarray.Dataset
Notes
The cond variable contains a boolean mask of valid data indices. However in that mask, True represents valid data and False represents invalid data.