merger package

Subpackages

Submodules

merger.merge module

Main module containing merge implementation

merger.merge.clean_metadata(metadata: dict) None[source]

Prepares metadata dictionary for insertion by removing inconsistent entries and performing escaping of attribute names

Parameters:

metadata (dict) – dictionary of attribute names and their associated data

merger.merge.init_dataset(merged_dataset: Dataset, groups: list[str], var_info: dict, max_dims: dict, input_files: list[Path]) None[source]

Initialize the dataset using data gathered from preprocessing

Parameters:
  • merged_dataset (nc.Dataset) – the dataset to be initialized

  • groups (list) – list of group names

  • var_info (dict) – dictionary of variable names and VariableInfo objects

  • max_dims (dict) – dictionary of dimension names (including path) and their sizes

  • input_files (list) – list of file paths to be merged

merger.merge.is_file_empty(parent_group: Dataset | Group) bool[source]

Test if a NetCDF file/group has all variables of zero size (recursively checks child groups)

merger.merge.merge_netcdf_files(original_input_files: list[~pathlib.Path], output_file: str, granule_urls, logger=<Logger merger.merge (WARNING)>, perf_stats: dict = None, process_count: int = None)[source]

Main entrypoint to merge implementation. Merges n >= 2 granules together as a single granule. Named in reference to the original Java implementation.

Parameters:
  • original_input_files (list) – list of Paths to NetCDF4 files to merge

  • output_file (str) – output path for merged product

  • granule_urls

  • logger (logger) – logger object

  • perf_stats (dict) – dictionary used to store performance stats

  • process_count (int) – number of processes to run (expected >= 1)

merger.merge_cli module

A simple CLI wrapper around the main merge function

merger.merge_cli.main()[source]

Main CLI entrypoint

merger.merge_worker module

Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes

merger.merge_worker.max_var_memory(file_list: list[Path], var_info: dict, max_dims) int[source]

Function to get the maximum shared memory that will be used for variables

Parameters:
  • file_list (list) – List of file paths to be processed

  • var_info (dict) – Dictionary of variable paths and associated VariableInfo

  • max_dims

merger.merge_worker.resize_var(var: Variable, var_info, max_dims: dict) ndarray[source]

Resizes a variable’s data to the maximum dimensions found in preprocessing. This method will never downscale a variable and only performs bottom and left padding as utilized in the original Java implementation

Parameters:
  • var (nc.Variable) – variable to be resized

  • var_info – contains a group path to this variable

  • max_dims (dict) – dictionary of maximum dimensions found during preprocessing

Returns:

An ndarray containing the resized data

Return type:

np.ndarray

merger.merge_worker.run_merge(merged_dataset: Dataset, file_list: list[Path], var_info: dict, max_dims: dict, process_count: int, logger: Logger)[source]

Automagically run merging in an optimized mode determined by the environment

Parameters:
  • merged_dataset (nc.Dataset) – Destination dataset of the merge operation

  • file_list (list) – List of file paths to be processed

  • var_info (dict) – Dictionary of variable paths and associated VariableInfo

  • max_dims (dict) – Dictionary of dimension paths and maximum dimensions found during preprocessing

  • process_count (int) – Number of worker processes to run (expected >= 1)

  • logger

merger.merge_worker.shared_memory_size() int[source]

try to get the shared memory space size by reading the /dev/shm on linux machines

merger.path_utils module

Utilities used throughout the merging implementation to simplify group path resolution and generation

merger.path_utils.collapse_dims(dims: dict) dict[source]

Collapse redundant child-dimension paths when a root dimension already exists.

If a dimension exists at the root level (e.g., “/mirror_step”) and a child path also defines the same dimension (e.g., “/product/mirror_step”), the child dimension is removed as it is redundant; resolution of the dimension will fall back to the root dimension via the existing resolve_dim logic.

Dimensions that appear only in child groups (i.e., have no parent/root version) are preserved.

Parameters:

dims (dict) –

Dictionary of {path: size}, where paths are HDF5/NetCDF-style dimension paths.

Example keys:

”/mirror_step” “/product/mirror_step” “/support_data/swt_level”

Returns:

A new dictionary with redundant child dimension declarations removed.

Return type:

dict

merger.path_utils.get_group_path(group: Group, resource: str) str[source]

Generates a Unix-like path from a group and resource to be accessed

Parameters:
  • group (nc.Group) – NetCDF4 group that contains the resource

  • resource (str) – name of the resource being accessed

Returns:

Unix-like path to the resource

Return type:

str

merger.path_utils.resolve_dim(dims: dict, group_path: str, dim_name: str)[source]

Attempt to resolve dim name starting from top-most group going down to the root group

Parameters:
  • dims (dict) – Dictionary of dimensions to be traversed

  • group_path (str) – the group path from which to start resolving the specific dimension

  • dim_name (str) – the name of the dim to be resolved

Returns:

the size of the dimension requested

Return type:

int

merger.path_utils.resolve_group(dataset: Dataset, path: str)[source]

Resolves a group path into two components: the group and the resource’s name

Parameters:
  • dataset (nc.Dataset) – NetCDF4 Dataset used as the root for all groups

  • path (str) – the path to the resource

Returns:

a tuple of the resolved group and the final path component str respectively

Return type:

tuple

merger.preprocess_worker module

Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes

merger.preprocess_worker.attr_eq(attr_1, attr_2) bool[source]

Helper function to check if one attribute value is equal to another (no, a simple == was not working)

Parameters:
  • attr_1 (obj) – An attribute value

  • attr_2 (obj) – An attribute value

merger.preprocess_worker.construct_history(input_files: list[Path], granule_urls: str) dict[source]

Construct history JSON entry for this concatenation operation https://wiki.earthdata.nasa.gov/display/TRT/In-File+Provenance+Metadata+-+TRT-42

Parameters:
  • input_files (list) – List of input files

  • granule_urls (str)

Returns:

History JSON constructed for this concat operation

Return type:

dict

merger.preprocess_worker.get_max_dims(group: Dataset | Group, max_dims: dict) None[source]

Aggregates dimensions from each group and creates a dictionary of the largest dimension sizes for each group

Parameters:
  • group (nc.Dataset, nc.Group) – group to process dimensions from

  • max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes

merger.preprocess_worker.get_metadata(group: Dataset | Group | Variable, metadata: dict) None[source]

Aggregates metadata from various NetCDF4 objects into a dictionary

Parameters:
  • group (nc.Dataset, nc.Group, nc.Variable) – the NetCDF4 object to aggregate metadata from

  • metadata (dict) – a dictionary containing the object name and associated metadata

merger.preprocess_worker.get_variable_data(group: Dataset | Group, var_info: dict, var_metadata: dict) None[source]

Aggregate variable metadata and attributes. Primarily utilized in process_groups

Parameters:
  • group (nc.Dataset, nc.Group) – group associated with this variable

  • var_info (dict) – dictionary of variable paths and associated VariableInfo

  • var_metadata (dict) – dictionary of variable paths and associated attribute dictionary

merger.preprocess_worker.merge_max_dims(merged_max_dims, subset_max_dims)[source]

Perform aggregation of max_dims. Intended for use in multithreaded mode only

Parameters:
  • merged_max_dims (dict) – Dictionary of the aggregated max_dims

  • subset_max_dims (dict) – Dictionary of max_dims from one of the worker processes

merger.preprocess_worker.merge_metadata(merged_metadata: dict, subset_metadata: dict) None[source]

Perform aggregation of metadata. Intended for use in multithreaded mode only

Parameters:
  • merged_metadata (dict) – Dictionary of the aggregated metadata

  • subset_max_dims (dict) – Dictionary of metadata from one of the worker processes

merger.preprocess_worker.process_groups(parent_group: Dataset | Group, group_list: list, max_dims: dict, group_metadata: dict, var_metadata: dict, var_info: dict)[source]

Perform preprocessing of a group and recursively process each child group

Parameters:
  • parent_group (nc.Dataset, nc.Group) – current group to be processed

  • group_list (list) – list of group paths

  • max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes

  • group_metadata (dict) – dictionary which stores group paths and their associated attributes

  • var_metadata (dict) – dictionary of dictionaries which stores variable paths and their associated attributes

  • var_info (dict) – dictionary of variable paths and associated VariableInfo data

merger.preprocess_worker.retrieve_history(dataset)[source]

Retrieve history_json field from NetCDF dataset, if it exists

Parameters:

dataset (netCDF4.Dataset) – NetCDF Dataset representing a single granule

Returns:

history_json field

Return type:

dict

merger.preprocess_worker.run_preprocess(file_list: list[Path], process_count: int, granule_urls: str) dict[source]

Automagically run preprocessing in an optimized mode determined by the environment

Parameters:
  • file_list (list) – List of file paths to be processed

  • process_count (int) – Number of worker processes to run (expected >= 1)

  • granule_urls

merger.variable_info module

Wrapper used to manage variable metadata

class merger.variable_info.VariableInfo(var)[source]

Bases: object

Lightweight wrapper class utilized in granule preprocessing to simply comparisons between different variables from different granule sets

name

name of the variable

Type:

str

dim_order

list of dimension names in order

Type:

list

datatype

the numpy datatype for the data held in the variable

Type:

numpy.dtype

group_path

Unix-like group path to the variable

Type:

str

fill_value

Value used to fill missing/empty values in variable’s data

Type:

object

Module contents