Subset Module

subset.py

Functions related to subsetting a NetCDF file.

podaac.subsetter.subset.align_time_to_lon_dim(time_data, lon_data, temporal_cond)[source]

Aligns a 1D time_data variable to one of the dimensions of a 2D lon_data array, renaming time_data’s dimension if it matches the size of one of lon_data’s dims.

This happens because combining a 2D x 2D x 1D bitwise mask with mismatched dimensions results in a 3D mask, which significantly increases memory usage. In this case, one of the dimensions is a “phony” dimension, so we need to align the time variable with the correct dimension to produce a proper 2D bitwise mask.

Parameters:

time_data (xr.DataArray) – 1D array of time values.
lon_data (xr.DataArray) – 2D array with two dimensions.
temporal_cond (xr.DataArray) – 1D boolean mask along the time dimension.

Returns:

temporal_cond, potentially renamed to match lon_data’s dims.

Return type:

xr.DataArray

podaac.subsetter.subset.apply_scale_offset(scale: float, offset: float, value: float) → float[source]: Apply scale and offset to the given value

podaac.subsetter.subset.calculate_chunks(dataset: Dataset) → dict[source]

For the given dataset, calculate if the size on any dimension is worth chunking. Any dimension larger than 4000 will be chunked. This is done to ensure that the variable can fit in memory.

Parameters:: dataset (xarray.Dataset) – The dataset to calculate chunks for.
Returns:: The chunk dictionary, where the key is the dimension and the value is 4000 or 500 depending on how many dimensions.
Return type:: dict

podaac.subsetter.subset.compute_coordinate_variable_names(dataset: Dataset) → Tuple[List[str] | str, List[str] | str][source]

Given a dataset, determine the coordinate variable from a list of options

Parameters:: dataset (xr.Dataset) – The dataset to find the coordinate variables for
Returns:: Tuple of strings (or list of strings), where the first element is the lat coordinate name and the second element is the lon coordinate name
Return type:: tuple, str

podaac.subsetter.subset.compute_utc_name(dataset: Dataset) → str | None[source]: Get the name of the utc variable if it is there to determine origine time

podaac.subsetter.subset.convert_bbox(bbox: ndarray, dataset: Dataset, lat_var_name: str, lon_var_name: str) → ndarray[source]

This function will return a converted bbox which matches the range of the given input file. This will convert both the latitude and longitude range. For example, an input dataset can have a valid longitude range of -180 –> 180 or of 0 –> 360.

Parameters:

bbox (np.array) – The bounding box
dataset (xarray.Dataset) – The dataset which is being subset.
lat_var_name (str) – Name of the lat variable in the given dataset
lon_var_name (str) – Name of the lon variable in the given dataset

Returns:

bbox – The new bbox which matches latitude and longitude ranges of the input file.

Return type:

np.array

Notes

Assumption that the provided bounding box is always between -180 –> 180 for longitude and -90, 90 for latitude.

podaac.subsetter.subset.convert_bound(bound: ndarray, coord_max: int, coord_var: DataArray) → ndarray[source]

This function will return a converted bound, which matches the range of the given input file.

Parameters:

bound (np.array) – 1-dimensional 2-element numpy array which represent the lower and upper bounding box on this coordinate, respectively.
coord_max (integer) – The max value which is possible given this coordinate. For example, the max for longitude is 360.
coord_var (xarray.DataArray) – The xarray variable for some coordinate.

Returns:

1-dimensional 2-element number array which represents the lower and upper bounding box on this coordinate and has been converted based on the valid bounds coordinate range of the dataset.

Return type:

np.array

Notes

Assumption that 0 is always on the prime meridian/equator.

podaac.subsetter.subset.convert_time_from_description(seconds_since, description: str)[source]

Convert time array from seconds-since format using the description.

Parameters:

seconds_since – xarray.DataArray or np.ndarray of float64 seconds
description – str containing “seconds since <date>”

Returns:

np.ndarray or xarray.DataArray with np.datetime64[ns]

podaac.subsetter.subset.create_geospatial_bounding_box(spatial_bounds_array, east, west)[source]

Generate a Well-Known Text (WKT) POLYGON string representing the geospatial bounds.

The polygon is defined using the min/max longitude and latitude values and follows the format: “POLYGON ((lon_min lat_min, lon_max lat_min, lon_max lat_max,

lon_min lat_max, lon_min lat_min))”

This ensures the polygon forms a closed loop.

Parameters:

spatial_bounds_arraylist of lists

A 2D list where: - spatial_bounds_array[0] contains [lon_min, lon_max] - spatial_bounds_array[1] contains [lat_min, lat_max]

east: float or None

longitude spacial bound east

west: float or None

longitude spacial bound west

Returns:

str: A WKT POLYGON string representing the bounding box.

Example:

>>> spatial_bounds = [[81.489693, 85.129562], [-78.832314, 49.646988]]
>>> create_geospatial_bounds(spatial_bounds)
'POLYGON ((81.489693 -78.832314, 85.129562 -78.832314, 85.129562 49.646988, 81.489693 49.646988, 81.489693 -78.832314))'

podaac.subsetter.subset.create_geospatial_bounds(dataset, lon_var_names, lat_var_names)[source]: Create geospatial bounds from 4 corners of 2d array

podaac.subsetter.subset.datetime_from_mjd(dataset: Dataset, time_var_name: str) → datetime | None[source]

Translate the modified julian date from the long name in the time attribute.

Parameters:

dataset (xr.Dataset) – Dataset that contains time var
time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)

Returns:

the datetime of the modified julian date

Return type:

datetime

podaac.subsetter.subset.ensure_counter_clockwise(points)[source]: Ensures the points are ordered counterclockwise.

podaac.subsetter.subset.extract_epoch(description: str) → str[source]: Extracts the ISO 8601 epoch from a description string. Example: “seconds since 1 January 1990” → “1990-01-01T00:00:00”

podaac.subsetter.subset.get_coordinate_variable_names(dataset: Dataset, lat_var_names: list | None = None, lon_var_names: list | None = None, time_var_names: list | None = None)[source]

Retrieve coordinate variables for this dataset. If coordinate variables are provided, use those, Otherwise, attempt to determine coordinate variables manually.

Parameters:

dataset (xr.Dataset) – xarray Dataset used to compute coordinate variables manually. Only used if lat, lon, or time vars are not provided.
lat_var_names (list) – List of latitude coordinate variables.
lon_var_names (list) – List of longitude coordinate variables.
time_var_names (list) – List of time coordinate variables.

Returns:

TODO

Return type:

add return type docstring and type hint.

podaac.subsetter.subset.get_east_west_lon(dataset, lon_var_name)[source]

Determines the easternmost and westernmost longitudes from a dataset, correctly handling cases where the data crosses the antimeridian.

Parameters:

dataset – xarray.Dataset or similar The dataset containing longitude values.
lon_var_name – str The name of the longitude variable in the dataset.

Returns:

(westmost, eastmost): The westernmost and easternmost longitudes in [-180, 180] range.

Return type:

tuple

podaac.subsetter.subset.get_hdf_type(tree: DataTree) → str | None[source]

Determine the HDF type (OMI or MLS) from a DataTree object.

Parameters:: tree (DataTree) – DataTree object containing the HDF data
Returns:: ‘OMI’, ‘MLS’, or None if type cannot be determined
Return type:: Optional[str]

podaac.subsetter.subset.get_path(s)[source]: Extracts the path by removing the last part after the final ‘/’.

podaac.subsetter.subset.get_time_epoch_var(tree: DataTree, time_var_name: str) → str[source]

Get the name of the epoch time var. This is only needed in the case where there is a single time var (of size 1) that contains the time epoch used by the actual time var.

Parameters:

tree (DataTree) – DataTree that contains time var
time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)

Returns:

The name of the epoch time variable

Return type:

str

podaac.subsetter.subset.get_variable_data(dtree, var_path)[source]

Retrieve data from a DataTree object given a variable path.

Parameters: - dtree: DataTree object - var_path: str, path to the variable (e.g., “group/time”)

Returns: - The data of the variable if found, else None.

podaac.subsetter.subset.is_360(lon_var: DataArray, scale: float, offset: float) → bool[source]

Determine if given dataset is a ‘360’ dataset or not.

Parameters:

lon_var (xr.DataArray) – The lon variable from the xarray Dataset
scale (float) – Used to remove scale and offset for easier calculation
offset (float) – Used to remove scale and offset for easier calculation

Returns:

True if dataset is 360, False if not. Defaults to False.

Return type:

bool

podaac.subsetter.subset.is_time_mjd(dataset: Dataset, time_var_name: str) → bool[source]

Check to see if the time format is a time delta from a modified julian date.

Parameters:

dataset (xr.Dataset) – Dataset that contains time var
time_var_name (str) – The name of the actual time var (with matching dims to the coord vars)

Returns:

is time delta format in modified julian date

Return type:

boolean

podaac.subsetter.subset.new_build_temporal_cond(min_time: str, max_time: str, dataset: Dataset, time_var_name: str) → ndarray | bool[source]

Build the temporal condition used in the xarray ‘where’ call which drops data not in the given bounds. If the data in the time var is of type ‘datetime’, assume this is a normal case where the time var uses the epoch from the ‘units’ metadata attribute to get epoch. If the data in the time var is of type ‘timedelta’, the epoch var is needed to calculate the datetime.

Parameters:

min_time (str) – ISO timestamp representing the lower temporal bound
max_time (str) – ISO timestamp representing the upper temporal bound
dataset (xr.Dataset) – Dataset to build the condition off of
time_var_name (str) – Name of the time variable

Returns:

If temporally subsetted, returns a boolean ND-array the shape of which matches the dimensions of the coordinate vars. ‘True’ is essentially a noop.

Return type:

np.array or boolean

podaac.subsetter.subset.normalize_paths(paths)[source]

Convert paths with __ notation to normal group paths, removing leading / and converting __ to /

Parameters:: paths (list of str) – List of paths with __ notation
Returns:: Normalized paths
Return type:: list of str

Examples

>>> paths = ['/__geolocation__latitude', '/__geolocation__longitude']
>>> normalize_paths(paths)
['/geolocation/latitude', '/geolocation/longitude']

podaac.subsetter.subset.open_as_nc_dataset(filepath: str) → Tuple[Dataset, bool][source]: Open netcdf file, and flatten groups if they exist.

podaac.subsetter.subset.override_decode_cf_datetime() → None[source]

WARNING !!! REMOVE AT EARLIEST XARRAY FIX, this is a override to xarray override_decode_cf_datetime function. xarray has problems decoding time units with format seconds since 2000-1-1 0:0:0 0, this solves by testing the unit to see if its parsable, if it is use original function, if not format unit into a parsable format.

https://github.com/pydata/xarray/issues/7210

podaac.subsetter.subset.remove_scale_offset(value: float, scale: float, offset: float) → float[source]: Remove scale and offset from the given value

podaac.subsetter.subset.set_json_history(dataset: Dataset, cut: bool, file_to_subset: str, bbox: ndarray | None = None, shapefile: str | None = None, origin_source=None) → None[source]

Set the ‘json_history’ metadata header of the granule to reflect the current version of the subsetter, as well as the parameters used to call the subsetter. This will append an json array to the json_history of the following format:

Parameters:

dataset (xarray.Dataset) – The dataset to change the header of
cut (boolean) – True to cut the scanline
file_to_subset (string) – The filepath of the file which was used to subset
bbox (np.ndarray) – The requested bounding box
shapefile (str) – Name of the shapefile to include in the version history
TODO (add docstring and type hint for origin_source parameter.)

podaac.subsetter.subset.set_version_history(dataset: Dataset, cut: bool, bbox: ndarray | None = None, shapefile: str | None = None) → None[source]

Set the ‘history’ metadata header of the granule to reflect the current version of the subsetter, as well as the parameters used to call the subsetter. This will append a line to the history of the following format:

TIMESTAMP podaac.subsetter VERSION (PARAMS)

Parameters:

dataset (xarray.Dataset) – The dataset to change the header of
cut (boolean) – True to cut the scanline
bbox (np.ndarray) – The requested bounding box
shapefile (str) – Name of the shapefile to include in the version history

podaac.subsetter.subset.shoelace_area(points)[source]: Computes the signed area of a polygon. Negative area → Counterclockwise Positive area → Clockwise (needs reversing)

podaac.subsetter.subset.subset(file_to_subset: str, bbox: ndarray, output_file: str, variables: List[str] | str | None = (), cut: bool = True, shapefile: str | None = None, min_time: str | None = None, max_time: str | None = None, origin_source: str | None = None, lat_var_names: List[str] = (), lon_var_names: List[str] = (), time_var_names: List[str] = (), pixel_subset: bool = False, stage_file_name_subsetted_true: str | None = None, stage_file_name_subsetted_false: str | None = None) → ndarray | None[source]

Subset a given NetCDF file given a bounding box

Parameters:

file_to_subset (string) – The location of the file which will be subset
bbox (np.ndarray) – The chosen bounding box. This is a tuple of tuples formatted as such: ((west, east), (south, north)). The assumption is that the valid range is ((-180, 180), (-90, 90)). This will be transformed as appropriate if the actual longitude range is 0-360.
output_file (string) – The file path for the output of the subsetting operation.
variables (list, str, optional) – List of variables to include in the resulting data file. NOTE: This will remove ALL variables which are not included in this list, including coordinate variables!
cut (boolean) – True if the scanline should be cut, False if the scanline should not be cut. Defaults to True.
shapefile (str) – Name of local shapefile used to subset given file.
min_time (str) – ISO timestamp representing the lower bound of the temporal subset to be performed. If this value is not provided, the granule will not be subset temporally on the lower bound.
max_time (str) – ISO timestamp representing the upper bound of the temporal subset to be performed. If this value is not provided, the granule will not be subset temporally on the upper bound.
origin_source (str) – Original location or filename of data to be used in “derived from” history element.
lat_var_names (list) – List of variables that represent the latitude coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.
lon_var_names (list) – List of variables that represent the longitude coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.
time_var_names (list) – List of variables that represent the time coordinate variables for this granule. This list will only contain more than one value in the case where there are multiple groups and different coordinate variables for each group.
pixel_subset (boolean) – Cut the lon lat based on the rows and columns within the bounding box, but could result with lon lats that are outside the bounding box
stage_file_name_subsetted_true (str) – stage file name if subsetting is true name depends on result of subset
stage_file_name_subsetted_false (str) – stage file name if subsetting is false name depends on result of subset
decode_times (# clean up time variable in SNDR before)
blank (# SNDR.AQUA files have ascending node time)
list(nc_dataset.variables)) (if any('__asc_node_tai93' in i for i in) –
asc_time_var = nc_dataset.variables[‘__asc_node_tai93’] if not asc_time_var[:] > 0:

del nc_dataset.variables[‘__asc_node_tai93’]

podaac.subsetter.subset.subset_with_bbox(dataset: Dataset, lat_var_names: list, lon_var_names: list, time_var_names: list, bbox: ndarray | None = None, cut: bool = True, min_time: str | None = None, max_time: str | None = None, pixel_subset: bool = False) → ndarray[source]

Subset an xarray Dataset using a spatial bounding box.

Parameters:

dataset (xr.Dataset) – Dataset to subset
lat_var_names (list) – Name of the latitude variables in the given dataset
lon_var_names (list) – Name of the longitude variables in the given dataset
time_var_names (list) – Name of the time variables in the given dataset
variables (list[str]) – List of variables to include in the result
bbox (np.array) – Spatial bounding box to subset Dataset with.
cut (bool) – True if scanline should be cut.
min_time (str) – ISO timestamp of min temporal bound
max_time (str) – ISO timestamp of max temporal bound
pixel_subset (boolean) – Cut the lon lat based on the rows and columns within the bounding box, but could result with lon lats that are outside the bounding box
TODO (add docstring and type hint for variables parameter.)

Returns:

np.array – Spatial bounds of Dataset after subset operation
TODO - fix this docstring type and the type hint to match code (currently returning a list[xr.Dataset])

podaac.subsetter.subset.subset_with_shapefile_multi(dataset: Dataset, lat_var_names: List[str], lon_var_names: List[str], shapefile: str, cut: bool, chunks, pixel_subset: bool) → Dict[str, ndarray][source]

Subset an xarray Dataset using a shapefile for multiple latitude and longitude variable pairs

Parameters:

dataset (xr.Dataset) – Dataset to subset
lat_var_names (List[str]) – List of latitude variable names in the given dataset
lon_var_names (List[str]) – List of longitude variable names in the given dataset
shapefile (str) – Absolute path to the shapefile used to subset the given dataset
cut (bool) – True if scanline should be cut
chunks (Union[Dict, None]) – Chunking specification for dask arrays

Returns:

Dictionary mapping variable names to their respective boolean masks Keys are formatted as “{lat_var_name}_{lon_var_name}”

Return type:

Dict[str, np.ndarray]

podaac.subsetter.subset.test_access_sst_dtime_values(nc_dataset)[source]

Test accessing values of ‘sst_dtime’ variable in a NetCDF file.

Parameters:: (netCDF4.Dataset) (nc_dataset)
Returns:: access_successful (bool)
Return type:: True if ‘sst_dtime’ values are accessible, False otherwise.

podaac.subsetter.subset.translate_longitude(geometry)[source]

Translates the longitude values of a Shapely geometry from the range [-180, 180) to [0, 360).

Parameters:: geometry (shapely.geometry.base.BaseGeometry) – The input shape geometry to be translated
Returns:: The translated shape geometry
Return type:: geometry

podaac.subsetter.subset.translate_timestamp(str_timestamp: str) → datetime[source]

Translate timestamp to datetime object

Parameters:: str_timestamp (str) – Timestamp string. ISO or RFC
Returns:: Constructed Datetime object
Return type:: datetime

podaac.subsetter.subset.update_netcdf_attrs(output_file: str, dataset: DataTree, lon_var_names: List[str], lat_var_names: List[str], spatial_bounds_array: list | None = None, stage_file_name_subsetted_true: str | None = None, stage_file_name_subsetted_false: str | None = None) → None[source]

Update NetCDF file attributes with spatial bounds and product name information.

Parameters:

output_file (str) – Path to the NetCDF file to be updated
dataset (xr.DataTree) – xarray data tree
lon_var_names (list) – List of possible longitude variable names
lat_var_names (list) – List of possible latitude variable names
spatial_bounds_array (list, optional) – Nested list containing spatial bounds in format: [[lon_min, lon_max], [lat_min, lat_max]]
stage_file_name_subsetted_true (str, optional) – Product name when subset is True
stage_file_name_subsetted_false (str, optional) – Product name when subset is False

Notes

Updates various geospatial attributes in the NetCDF file
Removes deprecated center coordinate attributes
Sets the product name based on provided parameters
Preserves original attribute types when setting new values

Example

>>> spatial_bounds = [[120.5, 130.5], [-10.5, 10.5]]
>>> update_netcdf_attrs("output.nc", datasets, ["lon"], spatial_bounds, "subset_true.nc", "subset_false.nc")