VirtualiZarr Useful Recipes with NASA Earthdata

Author: Dean Henze, PO.DAAC

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

Summary

This notebook goes through several functionalities of the VirtualiZarr package to create virtual reference files, specifically using it with NASA Earthdata and utilizing the earthaccess package. It is meant to be a quick-start reference that introduces some key capabilities / characteristics of the package once a user has a high-level understanding of virtual data sets and the cloud-computing challenges they address (see references in the Prerequisite knowledge section below). In short, VirtualiZarr is a Python package to create “reference files”, which can be thought of as road maps for the computer to efficiently navigate through large arrays in a single data file, or across many files. Once a reference file for a data set is created, utilizing it to open the data can speed up several processes including lazy loading, accessing subsets, and in some cases performing computations. Importantly, one can create a combined reference for all the files in a dataset and use it to lazy load / access the entire record at once.

The functionalities of VirtualiZarr (with earthaccess) covered in this notebook are:

Getting Data File endpoints in Earthdata Cloud which are needed for virtualizarr to create reference files.
Generating reference files for 1 day, 1 year, and the entire record of a ~750 GB data set. The data set used is the Level 4 global gridded 6-hourly wind product from the Cross-Calibrated Multi-Platform project (https://doi.org/10.5067/CCMP-6HW10M-L4V31), available on PO.DAAC. This section also covers speeding up the reference creation using parallel computing. Reference files are saved in both JSON and PARQUET formats. The latter is an important format as it reduces the reference file size by ~30x in our tests. Saving in ice chunk formats will be tested / covered in the coming months.
Combining reference files (in progress). The ability to combine reference files together is valuable, for example to upate reference files for forward-streaming datasets when new data are available, without re-creating the entire record from scratch. However, with the current workflows and version of VirtualiZarr, this is not possible due to our use of a specific kwarg when creating the reference files. The workflow is still included here (with errors) because it is anticipated that this will be fixed in upcoming versions. Alternately, the use of ice chunk will also likely solve this issue (ice chunk functionality to be tested soon).

Requirements, prerequisite knowledge, learning outcomes

Requirements to run this notebook

Earthdata login account: An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account.
Compute environment: This notebook is meant to be run in the cloud (AWS instance running in us-west-2). We used an m6i.4xlarge EC2 instance (16 CPU’s, 64 GiB memory) for the parallel computing sections. At minimum we recommend a VM with 10 CPU’s to make the parallel computations in Section 2.2.1 faster.
Optional Coiled account: To run the section on distributed clusters, Create a coiled account (free to sign up), and connect it to an AWS account. For more information on Coiled, setting up an account, and connecting it to an AWS account, see their website https://www.coiled.io.

Prerequisite knowledge

This notebook covers virtualizarr functionality but does not present the high-level ideas behind it. For an understanding of reference files and how they are meant to enhance in-cloud access to file formats that are not cloud optimized (such netCDF, HDF), please see e.g. this kerchunk page, or this page on virtualizarr.
Familiarity with the earthaccess and Xarray packages. Familiarity with directly accessing NASA Earthdata in the cloud.
The Cookbook notebook on Dask basics is handy for those new to parallel computating.

Learning Outcomes

This notebook serves both as a pedagogical resource for learning several key workflows as well as a quick reference guide. Readers will gain the understanding to combine the virtualizarr and earthaccess packages to create virtual dataset reference files for NASA Earthdata.

Import Packages

Note Zarr Version

Zarr version 2 is needed for the current implementation of this notebook, due to (as of February 2025) Zarr version 3 not accepting FSMap objects.

We ran this notebook in a Python 3.12 environment. The minimal working environment we used to run this notebook was:

zarr==2.18.4
fastparquet==2024.5.0
xarray==2025.1.2
earthaccess==0.11.0
fsspec==2024.10.0
dask==2024.5.2 ("dask[complete]"==2024.5.2 if using pip)
h5netcdf==1.3.0
matplotlib==3.9.2
jupyterlab
jupyter-server-proxy
virtualizarr==1.2.0
kerchunk==0.2.7

And optionally:

coiled==1.58.0

# Built-in packages
import os
import sys

# Filesystem management 
import fsspec
import earthaccess

# Data handling
import xarray as xr
from virtualizarr import open_virtual_dataset

# Parallel computing 
import multiprocessing
from dask import delayed
import dask.array as da
from dask.distributed import Client

# Other
import matplotlib.pyplot as plt

# Optional
import coiled

Other Setup

xr.set_options( # display options for xarray objects
    display_expand_attrs=False,
    display_expand_coords=True,
    display_expand_data=True,
)

<xarray.core.options.set_options at 0x7f78f44a7aa0>

1. Get Data File S3 endpoints in Earthdata Cloud

The first step is to find the S3 endpoints to the files. Handling access credentials to Earthdata and then finding the endpoints can be done a number of ways (e.g. using the requests, s3fs packages) but we use the earthaccess package for its ease of use. We get the endpoints for all files in the CCMP record.

# Get Earthdata creds
earthaccess.login()

<earthaccess.auth.Auth at 0x7f790c5b4770>

# Get AWS creds. Note that if you spend more than 1 hour in the notebook, you may have to re-run this line!!!
fs = earthaccess.get_s3_filesystem(daac="PODAAC")

# Locate CCMP file information / metadata:
granule_info = earthaccess.search_data(
    short_name="CCMP_WINDS_10M6HR_L4_V3.1",
    )

# Get S3 endpoints for all files:
data_s3links = [g.data_links(access="direct")[0] for g in granule_info]
data_s3links[0:3]

['s3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930102_V03.1_L4.nc',
 's3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930103_V03.1_L4.nc',
 's3://podaac-ops-cumulus-protected/CCMP_WINDS_10M6HR_L4_V3.1/CCMP_Wind_Analysis_19930105_V03.1_L4.nc']

2. Generate reference files for 1 day, 1 year, and entire record

2.1 First day

The virtualizarr function to generate reference information is compact. We use it on one file for demonstration.

Important

The kwarg loadable_variables is not mandatory to create a viable reference file, but will become important for rapid lazy loading when working with large combined reference files. Assign to this at minimum the list of 1D coordinate variable names for the data set (additional 1D or scalar vars can also be added). This functionality will be the default in future releases of virtualizarr.

# This will be assigned to 'loadable_variables' and needs to be modified per the specific 
# coord names of the data set:
coord_vars = ["latitude","longitude","time"]

%%time
reader_opts = {"storage_options": fs.storage_options} # S3 filesystem creds from previous section.

# Create reference for the first data file:
virtual_ds_example = open_virtual_dataset(
    data_s3links[0], indexes={}, 
    reader_options=reader_opts, loadable_variables=coord_vars
    )
print(virtual_ds_example)

<xarray.Dataset> Size: 66MB
Dimensions:    (time: 4, latitude: 720, longitude: 1440)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 32B 1993-01-02 ... 1993-01-02T18:00:00
Data variables:
    uwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    vwnd       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    ws         (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
    nobs       (time, latitude, longitude) float32 17MB ManifestArray<shape=(...
Attributes: (54)
CPU times: user 301 ms, sys: 123 ms, total: 424 ms
Wall time: 1.75 s

The reference can be saved to file and used to open the corresponding CCMP data file with Xarray:

virtual_ds_example.virtualize.to_kerchunk('virtual_ds_example.json', format='json')

# Open data using the reference file, using a small wrapper function around xarray's open_dataset. 
# This will shorten code blocks in other sections. 
def opends_withref(ref, fs_data):
    """
    "ref" is a reference file or object. "fs_data" is a filesystem with credentials to
    access the actual data files. 
    """
    storage_opts = {"fo": ref, "remote_protocol": "s3", "remote_options": fs_data.storage_options}
    fs_ref = fsspec.filesystem('reference', **storage_opts)
    m = fs_ref.get_mapper('')
    data = xr.open_dataset(
        m, engine="zarr", chunks={},
        backend_kwargs={"consolidated": False}
    )
    return data

data_example = opends_withref('virtual_ds_example.json', fs)
print(data_example)

<xarray.Dataset> Size: 66MB
Dimensions:    (latitude: 720, longitude: 1440, time: 4)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 32B 1993-01-02 ... 1993-01-02T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 17MB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (54)

# Also useful to note, these reference objects don't take much memory:
print(sys.getsizeof(virtual_ds_example), "bytes")

120 bytes

2.2 First year

Reference information for each data file in the year is created individually, and then the combined reference file for the year can be created.

For us, reference file creation for a single file takes about 0.7 seconds, so processing a year of files would take about 4.25 minuts. One can easly accomplish this with a for-loop:

virtual_ds_list = [
    open_virtual_dataset(
        p, indexes={},
        reader_options={"storage_options": fs.storage_options},
        loadable_variables=coord_vars
        )
    for p in data_s3links
    ]

However, we speed things up using basic parallel computing.

2.2.1 Method 1: parallelize using Dask local cluster

If using an m6i.4xlarge AWS EC2 instance, there are 16 CPUs available and each should have enough memory to utilize all at once. If working on a different VM-type, change the n_workers in the call to Client() below as needed.

# Check how many cpu's are on this VM:
print("CPU count =", multiprocessing.cpu_count())

CPU count = 16

# Start up cluster and print some information about it:
client = Client(n_workers=15, threads_per_worker=1)
print(client.cluster)
print("View any work being done on the cluster here", client.dashboard_link)

LocalCluster(cbeb9b3b, 'tcp://127.0.0.1:33393', workers=15, threads=15, memory=60.81 GiB)
View any work being done on the cluster here https://cluster-ykalm.dask.host/jupyter/proxy/8787/status

%%time
# Create individual references:
open_vds_par = delayed(open_virtual_dataset)
tasks = [
    open_vds_par(p, indexes={}, reader_options=reader_opts, loadable_variables=coord_vars) 
    for p in data_s3links[:365] # First year only!
    ]
virtual_ds_list = list(da.compute(*tasks)) # The xr.combine_nested() function below needs a list rather than a tuple.

CPU times: user 5.5 s, sys: 1.14 s, total: 6.64 s
Wall time: 47.6 s

Using the individual references to create the combined reference is fast and does not requre parallel computing.

%%time
# Create the combined reference
virtual_ds_combined = xr.combine_nested(virtual_ds_list, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

CPU times: user 181 ms, sys: 18.1 ms, total: 199 ms
Wall time: 195 ms

# Save in JSON or PARQUET format:
fname_combined_json = 'ref_combined_1year.json'
fname_combined_parq = 'ref_combined_1year.parq'
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_json, format='json')
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_parq, format='parquet')

%%time
# Test lazy loading of the combine reference file JSON:
data_json = opends_withref(fname_combined_json, fs)
print(data_json)

<xarray.Dataset> Size: 24GB
Dimensions:    (latitude: 720, longitude: 1440, time: 1460)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-04T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (47)
CPU times: user 33.6 ms, sys: 0 ns, total: 33.6 ms
Wall time: 32.5 ms

%%time
# Test lazy loading of the combine reference file PARQUET:
data_parq = opends_withref(fname_combined_parq, fs)
print(data_parq)

<xarray.Dataset> Size: 24GB
Dimensions:    (latitude: 720, longitude: 1440, time: 1460)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-04T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (47)
CPU times: user 27.2 ms, sys: 0 ns, total: 27.2 ms
Wall time: 24.7 ms

2.2.2 Optional method 2: parallelize using distributed cluster with Coiled

At PO.DAAC we have been testing the third party software/package Coiled which makes it easy to spin up distributed computing clusters in the cloud. Since we suspect that Coiled may become a key member of the Cloud ecosystem for earth science researchers, this optional section is included, which can be used as an alternative to Section 2.2.1 for generating the individual reference files in parallel.

%%time

## --------------------------------------------
## Create single reference files with parallel computing using Coiled
## --------------------------------------------

# Wrap `open_virtual_dataset()` into coiled function and copy to mulitple VM's:
open_vds_par = coiled.function(
    region="us-west-2", spot_policy="on-demand", 
    vm_type="m6i.large", n_workers=15
    )(open_virtual_dataset)

# Begin computations for first year only:
results = open_vds_par.map(
    data_s3links[:365], indexes={}, 
    reader_options=reader_opts, loadable_variables=coord_vars
    )

virtual_ds_list = []
for r in results:
    virtual_ds_list.append(r)

CPU times: user 2.6 s, sys: 135 ms, total: 2.73 s
Wall time: 2min 15s

open_vds_par.cluster.shutdown()

Using the individual references to create the combined reference is fast and does not requre parallel computing.

%%time
# Combining the individual references works the same as in Section 2.2.1:
virtual_ds_combined = xr.combine_nested(virtual_ds_list, concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

CPU times: user 176 ms, sys: 0 ns, total: 176 ms
Wall time: 176 ms

# Save in JSON or PARQUET format:
fname_combined_json = 'ref_combined_1year.json'
fname_combined_parq = 'ref_combined_1year.parq'
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_json, format='json')
virtual_ds_combined.virtualize.to_kerchunk(fname_combined_parq, format='parquet')

%%time
# Test lazy loading of the combine reference file JSON:
data_json = opends_withref(fname_combined_json, fs)
print(data_json)

<xarray.Dataset> Size: 24GB
Dimensions:    (latitude: 720, longitude: 1440, time: 1460)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-04T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (47)
CPU times: user 20.2 ms, sys: 4.06 ms, total: 24.2 ms
Wall time: 23.8 ms

%%time
# Test lazy loading of the combine reference file PARQUET:
data_parq = opends_withref(fname_combined_parq, fs)
print(data_parq)

<xarray.Dataset> Size: 24GB
Dimensions:    (latitude: 720, longitude: 1440, time: 1460)
Coordinates:
  * latitude   (latitude) float32 3kB -89.88 -89.62 -89.38 ... 89.38 89.62 89.88
  * longitude  (longitude) float32 6kB 0.125 0.375 0.625 ... 359.4 359.6 359.9
  * time       (time) datetime64[ns] 12kB 1993-01-02 ... 1994-01-04T18:00:00
Data variables:
    nobs       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    uwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    vwnd       (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
    ws         (time, latitude, longitude) float32 6GB dask.array<chunksize=(1, 720, 1440), meta=np.ndarray>
Attributes: (47)
CPU times: user 303 ms, sys: 12 ms, total: 315 ms
Wall time: 314 ms

2.3 Entire record

Processing the entire record follows the exact same workflow as processing the first year Section 2.2 (either parallelization method). The only modification required is to change the one instance of

data_s3links[:365]

with

data_s3links[:]

when setting up the parallel computations (occurs once in each of Sections 2.2.1 and 2.2.2). Optionally, also change the saved file names e.g. from ref_combined_1year.json to ref_combined_record.json.

For us, processing the entire record using a local cluster on an m6i.4xlarge EC2 instance, with 15 workers, took about 13 minutes. Using 20 m6i.large VM’s on a distributed cluster with Coiled also took ~15 minutes and cost ~$0.40.

Because the virtualizarr package is so efficient at combining many individual reference files together, and because the individual references have such small in-memory requirements, the workflows in Section 2.2 are assumed to scale to tens of thousands of files and TB’s of data. However, this assumption will be tested as the techniques in the notebook are applied to progressively larger data sets.

For us, lazy loading the entire record took ~3 seconds. Compare that to an attempt at opening these same files with Xarray the “traditional” way with a call to xr.open_mfdataset(). On a smaller machine, the following line of code will either fail or take a long (possibly very long) amount of time:

## You can try un-commenting and running this but your notebook will probably stall or crash:
# fobjs = earthaccess.open(granule_info)
# data = xr.open_mfdataset(fobjs[:])

3. Appending additional reference files

! Currently this is not viable since the loadable_variables kwarg was used when creating the individual reference files !

Using the loadable_variables kwarg is important for faster lazy loading of large data sets with combined reference files, but does have this current limitation. The issue is that in order to append an additional reference file to our already saved year-long reference file from the previous section, we need to be able to re-load that reference file back into memory as manifest arrays. This isn’t supported yet for files created with the loadable_variables kwarg.

For example, this is how we would append an extra day to the year-long reference file from section 2:

# In case this notebook has been running over an hour, refresh the file system and credentials:
fs = earthaccess.get_s3_filesystem(daac="PODAAC")
reader_opts = {"storage_options": fs.storage_options}

%%time
# Create reference file for 366th CCMP file:
vds_extraday = open_virtual_dataset(
    data_s3links[366], indexes={}, 
    reader_options=reader_opts, loadable_variables=coord_vars
    )

CPU times: user 319 ms, sys: 98 ms, total: 417 ms
Wall time: 1.72 s

%%time
# Try to add it to the year-long reference:
vds_year1 = open_virtual_dataset('ref_combined_1year.json', filetype='kerchunk')
vds_appended = xr.combine_nested([vds_year1, vds_extraday], concat_dim='time', coords='minimal', compat='override', combine_attrs='drop_conflicts')

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
File <timed exec>:2

File /opt/coiled/env/lib/python3.12/site-packages/virtualizarr/backend.py:199, in open_virtual_dataset(filepath, filetype, group, drop_variables, loadable_variables, decode_times, cftime_variables, indexes, virtual_array_class, virtual_backend_kwargs, reader_options, backend)
    196 if backend_cls is None:
    197     raise NotImplementedError(f"Unsupported file type: {filetype.name}")
--> 199 vds = backend_cls.open_virtual_dataset(
    200     filepath,
    201     group=group,
    202     drop_variables=drop_variables,
    203     loadable_variables=loadable_variables,
    204     decode_times=decode_times,
    205     indexes=indexes,
    206     virtual_backend_kwargs=virtual_backend_kwargs,
    207     reader_options=reader_options,
    208 )
    210 return vds

File /opt/coiled/env/lib/python3.12/site-packages/virtualizarr/readers/kerchunk.py:75, in KerchunkVirtualBackend.open_virtual_dataset(filepath, group, drop_variables, loadable_variables, decode_times, indexes, virtual_backend_kwargs, reader_options)
     72     with fs.open_file() as of:
     73         refs = ujson.load(of)
---> 75     vds = dataset_from_kerchunk_refs(KerchunkStoreRefs(refs), fs_root=fs_root)
     77 else:
     78     raise ValueError(
     79         "The input Kerchunk reference did not seem to be in Kerchunk's JSON or Parquet spec: https://fsspec.github.io/kerchunk/spec.html. If your Kerchunk generated references are saved in parquet format, make sure the file extension is `.parquet`. The Kerchunk format autodetection is quite flaky, so if your reference matches the Kerchunk spec feel free to open an issue: https://github.com/zarr-developers/VirtualiZarr/issues"
     80     )

File /opt/coiled/env/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:136, in dataset_from_kerchunk_refs(refs, drop_variables, virtual_array_class, indexes, fs_root)
    119 def dataset_from_kerchunk_refs(
    120     refs: KerchunkStoreRefs,
    121     drop_variables: list[str] = [],
   (...)    124     fs_root: str | None = None,
    125 ) -> Dataset:
    126     """
    127     Translate a store-level kerchunk reference dict into an xarray Dataset containing virtualized arrays.
    128 
   (...)    133         Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that.
    134     """
--> 136     vars = virtual_vars_from_kerchunk_refs(
    137         refs, drop_variables, virtual_array_class, fs_root=fs_root
    138     )
    139     ds_attrs = fully_decode_arr_refs(refs["refs"]).get(".zattrs", {})
    140     coord_names = ds_attrs.pop("coordinates", [])

File /opt/coiled/env/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:111, in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class, fs_root)
    105     drop_variables = []
    106 var_names_to_keep = [
    107     var_name for var_name in var_names if var_name not in drop_variables
    108 ]
    110 vars = {
--> 111     var_name: variable_from_kerchunk_refs(
    112         refs, var_name, virtual_array_class, fs_root=fs_root
    113     )
    114     for var_name in var_names_to_keep
    115 }
    116 return vars

File /opt/coiled/env/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:169, in variable_from_kerchunk_refs(refs, var_name, virtual_array_class, fs_root)
    167 dims = zattrs.pop("_ARRAY_DIMENSIONS")
    168 if chunk_dict:
--> 169     manifest = manifest_from_kerchunk_chunk_dict(chunk_dict, fs_root=fs_root)
    170     varr = virtual_array_class(zarray=zarray, chunkmanifest=manifest)
    171 elif len(zarray.shape) != 0:
    172     # empty variables don't have physical chunks, but zarray shows that the variable
    173     # is at least 1D

File /opt/coiled/env/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py:195, in manifest_from_kerchunk_chunk_dict(kerchunk_chunk_dict, fs_root)
    193 for k, v in kerchunk_chunk_dict.items():
    194     if isinstance(v, (str, bytes)):
--> 195         raise NotImplementedError(
    196             "Reading inlined reference data is currently not supported. [ToDo]"
    197         )
    198     elif not isinstance(v, (tuple, list)):
    199         raise TypeError(f"Unexpected type {type(v)} for chunk value: {v}")

NotImplementedError: Reading inlined reference data is currently not supported. [ToDo]