merger package

Subpackages

Submodules

merger.merge module

Main module containing merge implementation

merger.merge.clean_metadata(metadata)[source]

Prepares metadata dictionary for insertion by removing inconsistent entries and performing escaping of attribute names

Parameters

metadata (dict) – dictionary of attribute names and their associated data

merger.merge.init_dataset(merged_dataset, groups, var_info, max_dims, input_files)[source]

Initialize the dataset utilizing data gathered from preprocessing

Parameters
  • merged_dataset (nc.Dataset) – the dataset to be initialized

  • groups (list) – list of group names

  • var_info (dict) – dictionary of variable names and VariableInfo objects

  • max_dims (dict) – dictionary of dimension names (including path) and their sizes

  • input_files (list) – list of file paths to be merged

merger.merge.is_file_empty(parent_group)[source]

Function to test if a all variable size in a dataset is 0

merger.merge.merge_netcdf_files(original_input_files, output_file, logger=<Logger merger.merge (WARNING)>, perf_stats=None, process_count=None)[source]

Main entrypoint to merge implementation. Merges n >= 2 granules together as a single granule. Named in reference to original Java implementation.

Parameters
  • input_files (list) – list of string paths to NetCDF4 files to merge

  • output_file (str) – output path for merged product

  • logger (logger) – logger object

  • perf_stats (dict) – dictionary used to store performance stats

  • process_count (int) – number of processes to run (expected >= 1)

merger.merge_cli module

A simple CLI wrapper around the main merge function

merger.merge_cli.main()[source]

Main CLI entrypoint

merger.merge_worker module

Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes

merger.merge_worker.max_var_memory(file_list, var_info, max_dims)[source]

function to get the maximum shared memory that will be used for variables

Parameters
  • file_list (list) – List of file paths to be processed

  • var_info (dict) – Dictionary of variable paths and associated VariableInfo

merger.merge_worker.resize_var(var, var_info, max_dims)[source]

Resizes a variable’s data to the maximum dimensions found in preprocessing. This method will never downscale a variable and only performs bottom and left padding as utilized in the original Java implementation

Parameters
  • var (nc.Variable) – variable to be resized

  • group_path (str) – group path to this variable

  • max_dims (dict) – dictionary of maximum dimensions found during preprocessing

Returns

An ndarray containing the resized data

Return type

np.ndarray

merger.merge_worker.run_merge(merged_dataset, file_list, var_info, max_dims, process_count, logger)[source]

Automagically run merging in an optimized mode determined by the environment

Parameters
  • merged_dataset (nc.Dataset) – Destination dataset of the merge operation

  • file_list (list) – List of file paths to be processed

  • var_info (dict) – Dictionary of variable paths and associated VariableInfo

  • max_dims (dict) – Dictionary of dimension paths and maximum dimensions found during preprocessing

  • process_count (int) – Number of worker processes to run (expected >= 1)

merger.merge_worker.shared_memory_size()[source]

try to get the shared memory space size by reading the /dev/shm on linux machines

merger.path_utils module

Utilities used throughout the merging implementation to simplify group path resolution and generation

merger.path_utils.get_group_path(group, resource)[source]

Generates a Unix-like path from a group and resource to be accessed

Parameters
  • group (nc.Group) – NetCDF4 group that contains the resource

  • resource (str) – name of the resource being accessed

Returns

Unix-like path to the resource

Return type

str

merger.path_utils.resolve_dim(dims, group_path, dim_name)[source]

Attempt to resolve dim name starting from top-most group going down to the root group

Parameters
  • dims (dict) – Dictionary of dimensions to be traversed

  • group_path (str) – the group path from which to start resolving the specific dimension

  • dim_name (str) – the name of the dim to be resolved

Returns

the size of the dimension requested

Return type

int

merger.path_utils.resolve_group(dataset, path)[source]

Resolves a group path into two components: the group and the resource’s name

Parameters
  • dataset (nc.Dataset) – NetCDF4 Dataset used as the root for all groups

  • path (str) – the path to the resource

Returns

a tuple of the resolved group and the final path component str respectively

Return type

tuple

merger.preprocess_worker module

Preprocessing methods and the utilities to automagically run them in single-thread/multiprocess modes

merger.preprocess_worker.attr_eq(attr_1, attr_2)[source]

Helper function to check if one attribute value is equal to another (no, a simple == was not working)

Parameters
  • attr_1 (obj) – An attribute value

  • attr_2 (obj) – An attribute value

merger.preprocess_worker.construct_history(input_files)[source]

Construct history JSON entry for this concatenation operation https://wiki.earthdata.nasa.gov/display/TRT/In-File+Provenance+Metadata+-+TRT-42

Parameters

input_files (list) – List of input files

Returns

History JSON constructed for this concat operation

Return type

dict

merger.preprocess_worker.get_max_dims(group, max_dims)[source]

Aggregates dimensions from each group and creates a dictionary of the largest dimension sizes for each group

Parameters
  • group (nc.Dataset, nc.Group) – group to process dimensions from

  • max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes

merger.preprocess_worker.get_metadata(group, metadata)[source]

Aggregates metadata from various NetCDF4 objects into a dictionary

Parameters
  • group (nc.Dataset, nc.Group, nc.Variable) – the NetCDF4 object to aggregate metadata from

  • metadata (dict) – a dictionary containing the object name and associated metadata

merger.preprocess_worker.get_variable_data(group, var_info, var_metadata)[source]

Aggregate variable metadata and attributes. Primarily utilized in process_groups

Parameters
  • group (nc.Dataset, nc.Group) – group associated with this variable

  • var_info (dict) – dictionary of variable paths and associated VariableInfo

  • var_metadata (dict) – dictionary of variable paths and associated attribute dictionary

merger.preprocess_worker.merge_max_dims(merged_max_dims, subset_max_dims)[source]

Perform aggregation of max_dims. Intended for use in multithreaded mode only

Parameters
  • merged_max_dims (dict) – Dictionary of the aggregated max_dims

  • subset_max_dims (dict) – Dictionary of max_dims from one of the worker processes

merger.preprocess_worker.merge_metadata(merged_metadata, subset_metadata)[source]

Perform aggregation of metadata. Intended for use in multithreaded mode only

Parameters
  • merged_metadata (dict) – Dictionary of the aggregated metadata

  • subset_max_dims (dict) – Dictionary of metadata from one of the worker processes

merger.preprocess_worker.process_groups(parent_group, group_list, max_dims, group_metadata, var_metadata, var_info)[source]

Perform preprocessing of a group and recursively process each child group

Parameters
  • parent_group (nc.Dataset, nc.Group) – current group to be processed

  • group_list (list) – list of group paths

  • max_dims (dict) – dictionary which stores dimension paths and associated dimension sizes

  • group_metadata (dict) – dictionary which stores group paths and their associated attributes

  • var_metadata (dict) – dictionary of dictionaries which stores variable paths and their associated attributes

  • var_info (dict) – dictionary of variable paths and associated VariableInfo data

merger.preprocess_worker.retrieve_history(dataset)[source]

Retrieve history_json field from NetCDF dataset, if it exists

Parameters

dataset (netCDF4.Dataset) – NetCDF Dataset representing a single granule

Returns

history_json field

Return type

dict

merger.preprocess_worker.run_preprocess(file_list, process_count)[source]

Automagically run preprocessing in an optimized mode determined by the environment

Parameters
  • file_list (list) – List of file paths to be processed

  • process_count (int) – Number of worker processes to run (expected >= 1)

merger.variable_info module

Wrapper used to manage variable metadata

class merger.variable_info.VariableInfo(var)[source]

Bases: object

Lightweight wrapper class utilized in granule preprocessing to simply comparisons between different variables from different granule sets

name

name of the variable

Type

str

dim_order

list of dimension names in order

Type

list

datatype

the numpy datatype for the data held in the variable

Type

numpy.dtype

group_path

Unix-like group path to the variable

Type

str

fill_value

Value used to fill missing/empty values in variable’s data

Type

object

Module contents