merger package
Subpackages
Submodules
merger.merge module
Main module containing merge implementation
- merger.merge.clean_metadata(metadata)[source]
Prepares metadata dictionary for insertion by removing inconsistent entries and performing escaping of attribute names
- Parameters
metadata (dict) – dictionary of attribute names and their associated data
- merger.merge.init_dataset(merged_dataset, groups, var_info, max_dims, input_files)[source]
Initialize the dataset utilizing data gathered from preprocessing
- Parameters
merged_dataset (nc.Dataset) – the dataset to be initialized
groups (list) – list of group names
var_info (dict) – dictionary of variable names and VariableInfo objects
max_dims (dict) – dictionary of dimension names (including path) and their sizes
input_files (list) – list of file paths to be merged
- merger.merge.is_file_empty(parent_group)[source]
Function to test if a all variable size in a dataset is 0
- merger.merge.merge_netcdf_files(original_input_files, output_file, logger=<Logger merger.merge (WARNING)>, perf_stats=None, process_count=None)[source]
Main entrypoint to merge implementation. Merges n >= 2 granules together as a single granule. Named in reference to original Java implementation.
- Parameters
input_files (list) – list of string paths to NetCDF4 files to merge
output_file (str) – output path for merged product
logger (logger) – logger object
perf_stats (dict) – dictionary used to store performance stats
process_count (int) – number of processes to run (expected >= 1)
merger.merge_cli module
A simple CLI wrapper around the main merge function
merger.merge_worker module
Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes
- merger.merge_worker.max_var_memory(file_list, var_info, max_dims)[source]
function to get the maximum shared memory that will be used for variables
- Parameters
file_list (list) – List of file paths to be processed
var_info (dict) – Dictionary of variable paths and associated VariableInfo
- merger.merge_worker.resize_var(var, var_info, max_dims)[source]
Resizes a variable’s data to the maximum dimensions found in preprocessing. This method will never downscale a variable and only performs bottom and left padding as utilized in the original Java implementation
- Parameters
var (nc.Variable) – variable to be resized
group_path (str) – group path to this variable
max_dims (dict) – dictionary of maximum dimensions found during preprocessing
- Returns
An ndarray containing the resized data
- Return type
np.ndarray
- merger.merge_worker.run_merge(merged_dataset, file_list, var_info, max_dims, process_count, logger)[source]
Automagically run merging in an optimized mode determined by the environment
- Parameters
merged_dataset (nc.Dataset) – Destination dataset of the merge operation
file_list (list) – List of file paths to be processed
var_info (dict) – Dictionary of variable paths and associated VariableInfo
max_dims (dict) – Dictionary of dimension paths and maximum dimensions found during preprocessing
process_count (int) – Number of worker processes to run (expected >= 1)
try to get the shared memory space size by reading the /dev/shm on linux machines
merger.path_utils module
Utilities used throughout the merging implementation to simplify group path resolution and generation
- merger.path_utils.get_group_path(group, resource)[source]
Generates a Unix-like path from a group and resource to be accessed
- Parameters
group (nc.Group) – NetCDF4 group that contains the resource
resource (str) – name of the resource being accessed
- Returns
Unix-like path to the resource
- Return type
str
- merger.path_utils.resolve_dim(dims, group_path, dim_name)[source]
Attempt to resolve dim name starting from top-most group going down to the root group
- Parameters
dims (dict) – Dictionary of dimensions to be traversed
group_path (str) – the group path from which to start resolving the specific dimension
dim_name (str) – the name of the dim to be resolved
- Returns
the size of the dimension requested
- Return type
int
- merger.path_utils.resolve_group(dataset, path)[source]
Resolves a group path into two components: the group and the resource’s name
- Parameters
dataset (nc.Dataset) – NetCDF4 Dataset used as the root for all groups
path (str) – the path to the resource
- Returns
a tuple of the resolved group and the final path component str respectively
- Return type
tuple
merger.preprocess_worker module
Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes
- merger.preprocess_worker.attr_eq(attr_1, attr_2)[source]
Helper function to check if one attribute value is equal to another (no, a simple == was not working)
- Parameters
attr_1 (obj) – An attribute value
attr_2 (obj) – An attribute value
- merger.preprocess_worker.construct_history(input_files)[source]
Construct history JSON entry for this concatenation operation https://wiki.earthdata.nasa.gov/display/TRT/In-File+Provenance+Metadata+-+TRT-42
- Parameters
input_files (list) – List of input files
- Returns
History JSON constructed for this concat operation
- Return type
dict
- merger.preprocess_worker.get_max_dims(group, max_dims)[source]
Aggregates dimensions from each group and creates a dictionary of the largest dimension sizes for each group
- Parameters
group (nc.Dataset, nc.Group) – group to process dimensions from
max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes
- merger.preprocess_worker.get_metadata(group, metadata)[source]
Aggregates metadata from various NetCDF4 objects into a dictionary
- Parameters
group (nc.Dataset, nc.Group, nc.Variable) – the NetCDF4 object to aggregate metadata from
metadata (dict) – a dictionary containing the object name and associated metadata
- merger.preprocess_worker.get_variable_data(group, var_info, var_metadata)[source]
Aggregate variable metadata and attributes. Primarily utilized in process_groups
- Parameters
group (nc.Dataset, nc.Group) – group associated with this variable
var_info (dict) – dictionary of variable paths and associated VariableInfo
var_metadata (dict) – dictionary of variable paths and associated attribute dictionary
- merger.preprocess_worker.merge_max_dims(merged_max_dims, subset_max_dims)[source]
Perform aggregation of max_dims. Intended for use in multithreaded mode only
- Parameters
merged_max_dims (dict) – Dictionary of the aggregated max_dims
subset_max_dims (dict) – Dictionary of max_dims from one of the worker processes
- merger.preprocess_worker.merge_metadata(merged_metadata, subset_metadata)[source]
Perform aggregation of metadata. Intended for use in multithreaded mode only
- Parameters
merged_metadata (dict) – Dictionary of the aggregated metadata
subset_max_dims (dict) – Dictionary of metadata from one of the worker processes
- merger.preprocess_worker.process_groups(parent_group, group_list, max_dims, group_metadata, var_metadata, var_info)[source]
Perform preprocessing of a group and recursively process each child group
- Parameters
parent_group (nc.Dataset, nc.Group) – current group to be processed
group_list (list) – list of group paths
max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes
group_metadata (dict) – dictionary which stores group paths and their associated attributes
var_metadata (dict) – dictionary of dictionaries which stores variable paths and their associated attributes
var_info (dict) – dictionary of variable paths and associated VariableInfo data
merger.variable_info module
Wrapper used to manage variable metadata
- class merger.variable_info.VariableInfo(var)[source]
Bases:
object
Lightweight wrapper class utilized in granule preprocessing to simply comparisons between different variables from different granule sets
- name
name of the variable
- Type
str
- dim_order
list of dimension names in order
- Type
list
- datatype
the numpy datatype for the data held in the variable
- Type
numpy.dtype
- group_path
Unix-like group path to the variable
- Type
str
- fill_value
Value used to fill missing/empty values in variable’s data
- Type
object