merger package
Subpackages
Submodules
merger.merge module
Main module containing merge implementation
- merger.merge.clean_metadata(metadata: dict) None[source]
Prepares metadata dictionary for insertion by removing inconsistent entries and performing escaping of attribute names
- Parameters:
metadata (dict) – dictionary of attribute names and their associated data
- merger.merge.init_dataset(merged_dataset: Dataset, groups: list[str], var_info: dict, max_dims: dict, input_files: list[Path]) None[source]
Initialize the dataset using data gathered from preprocessing
- Parameters:
merged_dataset (nc.Dataset) – the dataset to be initialized
groups (list) – list of group names
var_info (dict) – dictionary of variable names and VariableInfo objects
max_dims (dict) – dictionary of dimension names (including path) and their sizes
input_files (list) – list of file paths to be merged
- merger.merge.is_file_empty(parent_group: Dataset | Group) bool[source]
Test if a NetCDF file/group has all variables of zero size (recursively checks child groups)
- merger.merge.merge_netcdf_files(original_input_files: list[~pathlib.Path], output_file: str, granule_urls, logger=<Logger merger.merge (WARNING)>, perf_stats: dict = None, process_count: int = None)[source]
Main entrypoint to merge implementation. Merges n >= 2 granules together as a single granule. Named in reference to the original Java implementation.
- Parameters:
original_input_files (list) – list of Paths to NetCDF4 files to merge
output_file (str) – output path for merged product
granule_urls
logger (logger) – logger object
perf_stats (dict) – dictionary used to store performance stats
process_count (int) – number of processes to run (expected >= 1)
merger.merge_cli module
A simple CLI wrapper around the main merge function
merger.merge_worker module
Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes
- merger.merge_worker.max_var_memory(file_list: list[Path], var_info: dict, max_dims) int[source]
Function to get the maximum shared memory that will be used for variables
- Parameters:
file_list (list) – List of file paths to be processed
var_info (dict) – Dictionary of variable paths and associated VariableInfo
max_dims
- merger.merge_worker.resize_var(var: Variable, var_info, max_dims: dict) ndarray[source]
Resizes a variable’s data to the maximum dimensions found in preprocessing. This method will never downscale a variable and only performs bottom and left padding as utilized in the original Java implementation
- Parameters:
var (nc.Variable) – variable to be resized
var_info – contains a group path to this variable
max_dims (dict) – dictionary of maximum dimensions found during preprocessing
- Returns:
An ndarray containing the resized data
- Return type:
np.ndarray
- merger.merge_worker.run_merge(merged_dataset: Dataset, file_list: list[Path], var_info: dict, max_dims: dict, process_count: int, logger: Logger)[source]
Automagically run merging in an optimized mode determined by the environment
- Parameters:
merged_dataset (nc.Dataset) – Destination dataset of the merge operation
file_list (list) – List of file paths to be processed
var_info (dict) – Dictionary of variable paths and associated VariableInfo
max_dims (dict) – Dictionary of dimension paths and maximum dimensions found during preprocessing
process_count (int) – Number of worker processes to run (expected >= 1)
logger
try to get the shared memory space size by reading the /dev/shm on linux machines
merger.path_utils module
Utilities used throughout the merging implementation to simplify group path resolution and generation
- merger.path_utils.collapse_dims(dims: dict) dict[source]
Collapse redundant child-dimension paths when a root dimension already exists.
If a dimension exists at the root level (e.g., “/mirror_step”) and a child path also defines the same dimension (e.g., “/product/mirror_step”), the child dimension is removed as it is redundant; resolution of the dimension will fall back to the root dimension via the existing resolve_dim logic.
Dimensions that appear only in child groups (i.e., have no parent/root version) are preserved.
- Parameters:
dims (dict) –
Dictionary of {path: size}, where paths are HDF5/NetCDF-style dimension paths.
- Example keys:
”/mirror_step” “/product/mirror_step” “/support_data/swt_level”
- Returns:
A new dictionary with redundant child dimension declarations removed.
- Return type:
dict
- merger.path_utils.get_group_path(group: Group, resource: str) str[source]
Generates a Unix-like path from a group and resource to be accessed
- Parameters:
group (nc.Group) – NetCDF4 group that contains the resource
resource (str) – name of the resource being accessed
- Returns:
Unix-like path to the resource
- Return type:
str
- merger.path_utils.resolve_dim(dims: dict, group_path: str, dim_name: str)[source]
Attempt to resolve dim name starting from top-most group going down to the root group
- Parameters:
dims (dict) – Dictionary of dimensions to be traversed
group_path (str) – the group path from which to start resolving the specific dimension
dim_name (str) – the name of the dim to be resolved
- Returns:
the size of the dimension requested
- Return type:
int
- merger.path_utils.resolve_group(dataset: Dataset, path: str)[source]
Resolves a group path into two components: the group and the resource’s name
- Parameters:
dataset (nc.Dataset) – NetCDF4 Dataset used as the root for all groups
path (str) – the path to the resource
- Returns:
a tuple of the resolved group and the final path component str respectively
- Return type:
tuple
merger.preprocess_worker module
Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes
- merger.preprocess_worker.attr_eq(attr_1, attr_2) bool[source]
Helper function to check if one attribute value is equal to another (no, a simple == was not working)
- Parameters:
attr_1 (obj) – An attribute value
attr_2 (obj) – An attribute value
- merger.preprocess_worker.construct_history(input_files: list[Path], granule_urls: str) dict[source]
Construct history JSON entry for this concatenation operation https://wiki.earthdata.nasa.gov/display/TRT/In-File+Provenance+Metadata+-+TRT-42
- Parameters:
input_files (list) – List of input files
granule_urls (str)
- Returns:
History JSON constructed for this concat operation
- Return type:
dict
- merger.preprocess_worker.get_max_dims(group: Dataset | Group, max_dims: dict) None[source]
Aggregates dimensions from each group and creates a dictionary of the largest dimension sizes for each group
- Parameters:
group (nc.Dataset, nc.Group) – group to process dimensions from
max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes
- merger.preprocess_worker.get_metadata(group: Dataset | Group | Variable, metadata: dict) None[source]
Aggregates metadata from various NetCDF4 objects into a dictionary
- Parameters:
group (nc.Dataset, nc.Group, nc.Variable) – the NetCDF4 object to aggregate metadata from
metadata (dict) – a dictionary containing the object name and associated metadata
- merger.preprocess_worker.get_variable_data(group: Dataset | Group, var_info: dict, var_metadata: dict) None[source]
Aggregate variable metadata and attributes. Primarily utilized in process_groups
- Parameters:
group (nc.Dataset, nc.Group) – group associated with this variable
var_info (dict) – dictionary of variable paths and associated VariableInfo
var_metadata (dict) – dictionary of variable paths and associated attribute dictionary
- merger.preprocess_worker.merge_max_dims(merged_max_dims, subset_max_dims)[source]
Perform aggregation of max_dims. Intended for use in multithreaded mode only
- Parameters:
merged_max_dims (dict) – Dictionary of the aggregated max_dims
subset_max_dims (dict) – Dictionary of max_dims from one of the worker processes
- merger.preprocess_worker.merge_metadata(merged_metadata: dict, subset_metadata: dict) None[source]
Perform aggregation of metadata. Intended for use in multithreaded mode only
- Parameters:
merged_metadata (dict) – Dictionary of the aggregated metadata
subset_max_dims (dict) – Dictionary of metadata from one of the worker processes
- merger.preprocess_worker.process_groups(parent_group: Dataset | Group, group_list: list, max_dims: dict, group_metadata: dict, var_metadata: dict, var_info: dict)[source]
Perform preprocessing of a group and recursively process each child group
- Parameters:
parent_group (nc.Dataset, nc.Group) – current group to be processed
group_list (list) – list of group paths
max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes
group_metadata (dict) – dictionary which stores group paths and their associated attributes
var_metadata (dict) – dictionary of dictionaries which stores variable paths and their associated attributes
var_info (dict) – dictionary of variable paths and associated VariableInfo data
- merger.preprocess_worker.retrieve_history(dataset)[source]
Retrieve history_json field from NetCDF dataset, if it exists
- Parameters:
dataset (netCDF4.Dataset) – NetCDF Dataset representing a single granule
- Returns:
history_json field
- Return type:
dict
- merger.preprocess_worker.run_preprocess(file_list: list[Path], process_count: int, granule_urls: str) dict[source]
Automagically run preprocessing in an optimized mode determined by the environment
- Parameters:
file_list (list) – List of file paths to be processed
process_count (int) – Number of worker processes to run (expected >= 1)
granule_urls
merger.variable_info module
Wrapper used to manage variable metadata
- class merger.variable_info.VariableInfo(var)[source]
Bases:
objectLightweight wrapper class utilized in granule preprocessing to simply comparisons between different variables from different granule sets
- name
name of the variable
- Type:
str
- dim_order
list of dimension names in order
- Type:
list
- datatype
the numpy datatype for the data held in the variable
- Type:
numpy.dtype
- group_path
Unix-like group path to the variable
- Type:
str
- fill_value
Value used to fill missing/empty values in variable’s data
- Type:
object