merger package

Subpackages

merger.harmony package

Submodules

merger.merge module

Main module containing merge implementation

merger.merge.clean_metadata(metadata)[source]

Prepares metadata dictionary for insertion by removing inconsistent entries and performing escaping of attribute names

Parameters: metadata (dict) – dictionary of attribute names and their associated data

merger.merge.init_dataset(merged_dataset, groups, var_info, max_dims, input_files)[source]

Initialize the dataset utilizing data gathered from preprocessing

Parameters

merged_dataset (nc.Dataset) – the dataset to be initialized
groups (list) – list of group names
var_info (dict) – dictionary of variable names and VariableInfo objects
max_dims (dict) – dictionary of dimension names (including path) and their sizes
input_files (list) – list of file paths to be merged

merger.merge.is_file_empty(parent_group)[source]: Function to test if a all variable size in a dataset is 0

merger.merge.merge_netcdf_files(original_input_files, output_file, logger=<Logger merger.merge (WARNING)>, perf_stats=None, process_count=None)[source]

Main entrypoint to merge implementation. Merges n >= 2 granules together as a single granule. Named in reference to original Java implementation.

Parameters

input_files (list) – list of string paths to NetCDF4 files to merge
output_file (str) – output path for merged product
logger (logger) – logger object
perf_stats (dict) – dictionary used to store performance stats
process_count (int) – number of processes to run (expected >= 1)

merger.merge_cli module

A simple CLI wrapper around the main merge function

merger.merge_cli.main()[source]: Main CLI entrypoint

merger.merge_worker module

Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes

merger.merge_worker.max_var_memory(file_list, var_info, max_dims)[source]

function to get the maximum shared memory that will be used for variables

Parameters

file_list (list) – List of file paths to be processed
var_info (dict) – Dictionary of variable paths and associated VariableInfo

merger.merge_worker.resize_var(var, var_info, max_dims)[source]

Resizes a variable’s data to the maximum dimensions found in preprocessing. This method will never downscale a variable and only performs bottom and left padding as utilized in the original Java implementation

Parameters

var (nc.Variable) – variable to be resized
group_path (str) – group path to this variable
max_dims (dict) – dictionary of maximum dimensions found during preprocessing

Returns

An ndarray containing the resized data

Return type

np.ndarray

merger.merge_worker.run_merge(merged_dataset, file_list, var_info, max_dims, process_count, logger)[source]

Automagically run merging in an optimized mode determined by the environment

Parameters

merged_dataset (nc.Dataset) – Destination dataset of the merge operation
file_list (list) – List of file paths to be processed
var_info (dict) – Dictionary of variable paths and associated VariableInfo
max_dims (dict) – Dictionary of dimension paths and maximum dimensions found during preprocessing
process_count (int) – Number of worker processes to run (expected >= 1)

merger.merge_worker.shared_memory_size()[source]: try to get the shared memory space size by reading the /dev/shm on linux machines

merger.path_utils module

Utilities used throughout the merging implementation to simplify group path resolution and generation

merger.path_utils.get_group_path(group, resource)[source]

Generates a Unix-like path from a group and resource to be accessed

Parameters

group (nc.Group) – NetCDF4 group that contains the resource
resource (str) – name of the resource being accessed

Returns

Unix-like path to the resource

Return type

str

merger.path_utils.resolve_dim(dims, group_path, dim_name)[source]

Attempt to resolve dim name starting from top-most group going down to the root group

Parameters

dims (dict) – Dictionary of dimensions to be traversed
group_path (str) – the group path from which to start resolving the specific dimension
dim_name (str) – the name of the dim to be resolved

Returns

the size of the dimension requested

Return type

int

merger.path_utils.resolve_group(dataset, path)[source]

Resolves a group path into two components: the group and the resource’s name

Parameters

dataset (nc.Dataset) – NetCDF4 Dataset used as the root for all groups
path (str) – the path to the resource

Returns

a tuple of the resolved group and the final path component str respectively

Return type

tuple

merger.preprocess_worker module

Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes

merger.preprocess_worker.attr_eq(attr_1, attr_2)[source]

Helper function to check if one attribute value is equal to another (no, a simple == was not working)

Parameters

attr_1 (obj) – An attribute value
attr_2 (obj) – An attribute value

merger.preprocess_worker.construct_history(input_files)[source]

Construct history JSON entry for this concatenation operation https://wiki.earthdata.nasa.gov/display/TRT/In-File+Provenance+Metadata+-+TRT-42

Parameters: input_files (list) – List of input files
Returns: History JSON constructed for this concat operation
Return type: dict

merger.preprocess_worker.get_max_dims(group, max_dims)[source]

Aggregates dimensions from each group and creates a dictionary of the largest dimension sizes for each group

Parameters

group (nc.Dataset, nc.Group) – group to process dimensions from
max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes

merger.preprocess_worker.get_metadata(group, metadata)[source]

Aggregates metadata from various NetCDF4 objects into a dictionary

Parameters

group (nc.Dataset, nc.Group, nc.Variable) – the NetCDF4 object to aggregate metadata from
metadata (dict) – a dictionary containing the object name and associated metadata

merger.preprocess_worker.get_variable_data(group, var_info, var_metadata)[source]

Aggregate variable metadata and attributes. Primarily utilized in process_groups

Parameters

group (nc.Dataset, nc.Group) – group associated with this variable
var_info (dict) – dictionary of variable paths and associated VariableInfo
var_metadata (dict) – dictionary of variable paths and associated attribute dictionary

merger.preprocess_worker.merge_max_dims(merged_max_dims, subset_max_dims)[source]

Perform aggregation of max_dims. Intended for use in multithreaded mode only

Parameters

merged_max_dims (dict) – Dictionary of the aggregated max_dims
subset_max_dims (dict) – Dictionary of max_dims from one of the worker processes

merger.preprocess_worker.merge_metadata(merged_metadata, subset_metadata)[source]

Perform aggregation of metadata. Intended for use in multithreaded mode only

Parameters

merged_metadata (dict) – Dictionary of the aggregated metadata
subset_max_dims (dict) – Dictionary of metadata from one of the worker processes

merger.preprocess_worker.process_groups(parent_group, group_list, max_dims, group_metadata, var_metadata, var_info)[source]

Perform preprocessing of a group and recursively process each child group

Parameters

parent_group (nc.Dataset, nc.Group) – current group to be processed
group_list (list) – list of group paths
max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes
group_metadata (dict) – dictionary which stores group paths and their associated attributes
var_metadata (dict) – dictionary of dictionaries which stores variable paths and their associated attributes
var_info (dict) – dictionary of variable paths and associated VariableInfo data

merger.preprocess_worker.retrieve_history(dataset)[source]

Retrieve history_json field from NetCDF dataset, if it exists

Parameters: dataset (netCDF4.Dataset) – NetCDF Dataset representing a single granule
Returns: history_json field
Return type: dict

merger.preprocess_worker.run_preprocess(file_list, process_count)[source]

Automagically run preprocessing in an optimized mode determined by the environment

Parameters

file_list (list) – List of file paths to be processed
process_count (int) – Number of worker processes to run (expected >= 1)

merger.variable_info module

Wrapper used to manage variable metadata

class merger.variable_info.VariableInfo(var)[source]

Bases: object

Lightweight wrapper class utilized in granule preprocessing to simply comparisons between different variables from different granule sets

name

name of the variable

Type: str

dim_order

list of dimension names in order

Type: list

datatype

the numpy datatype for the data held in the variable

Type: numpy.dtype

group_path

Unix-like group path to the variable

Type: str

fill_value

Value used to fill missing/empty values in variable’s data

Type: object

merger package

Subpackages

Submodules

merger.merge module

merger.merge_cli module

merger.merge_worker module

merger.path_utils module

merger.preprocess_worker module

merger.variable_info module

Module contents