Cloud Optimized Data Formats
Faster data access when working in the cloud
Traditional Earth data formats like netCDF and HDF are not optimal for storing and then accessing those data in the cloud, e.g. for streaming data from the files rather than direct downloading. They do work, but new file formats are emerging which will allow data to be accessed faster. In some cases, future data sets will be completely in these new formats, such as Zarr (no netCDF, HDF available). In other cases, technology is emerging that will allow data sets to still be written in more traditional formats, but with supplementary files which act as a “road map” to the data, allowing machines/software to navigate the data files more efficiently. Which ever is implemented, a key goal is to minimize the additional technical knowledge a user must aquire just to access the data. For example, the Python package Xarray
is already adding new capabilities to its open_mfdataset()
function so it works in a similar way whether using netCDF or cloud optimized data. There are many resources to gain a high-level understanding of cloud optimized data, such as this page Cloud-Optimized Geospatial Formats Guide.
Zarr
- Tutorial for NetCDF4 Files - Teaches about the Zarr cloud optimized format
Kerchunk and Virtualizarr
Both Kerchunk and Virtualizarr are Python packages that aim to achieve cloud-optimized results while still allowing a data set to be stored in traditional formats (e.g. netCDF, HDF). These packages create relatively small supplementary files that can be used by software to more efficiently navigate through a collection of e.g. netCDF files. This allows much faster lazy loading, faster selection and access to subsets of the data, and in some cases faster computations.
- Kerchunk recipes - Meant to be used after having a high-level understanding of the pacakge, this notebook goes through several functionalities of kerchunk that we found relevant to Earthdata users. Workflows here combine Kerchunk with the earthaccess package.
- Kerchunk JSON Generation - An additional tutorial on generating a Kerchunk JSON file, demonstrating its use with one of the SWOT data sets hosted on PO.DAAC. Creates output for input in the following tutorial.
- Integrating Dask, Kerchunk, Zarr and Xarray - Efficiently visualize a whole collection of data in an interactive dashboard via cloud-optimized formats.
- Virtualizarr (coming soon)