import matplotlib.pyplot as plt
import xarray as xr
import pandas as pd
import numpy as np
import requests
import json
import time
import s3fs
= "ECCO_L4_OBP_05DEG_MONTHLY_V4R4" ShortName
COF Zarr Access via Reformat
imported on: 2024-10-30
This notebook is from a different repository in NASA’s PO.DAAC, ECCO.
The original source for this document is https://github.com/podaac/ECCO/blob/main/Data_Access/cloud_harmony_zarr_reformat.ipynb
Access ECCO data via Harmony & the Zarr reformatter service
This notebook leverages the Zarr reformatter service (available through Harmony API) to access ocean bottom pressure (OBP) data from ECCO V4r4 in Zarr format (instead of native netCDF4 file format).
Getting Started
We will access monthly ocean bottom pressure (OBP) data from ECCO V4r4 (10.5067/ECG5M-OBP44), which are provided as a monthly time series on a 0.5-degree latitude/longitude grid.
The data are archived in netCDF format. However, this notebook demonstration will request conversion to Zarr format for files covering the period between 2010 and 2018. Upon receiving our request, Harmony’s backend will convert the files and stage them in S3 for native access in AWS (us-west-2
region, specifically). We will access the new Zarr datasets as an aggregated dataset using xarray, and leverage the S3 native protocols for direct access to the data in an efficient manner.
Requirements
AWS
This notebook should be running in an EC2 instance in AWS region us-west-2, as previously mentioned. We recommend using an EC2 with at least 8GB of memory available.
The notebook was developed and tested using a t2.large instance (2 cpus; 8GB memory).
Python 3
Most of these imports are from the Python standard library. However, you will need to install these packages into your Python 3 environment if you have not already done so:
- s3fs
- requests
- pandas
- xarray
- matplotlib
Requirements
Study period
Set some “master” inputs to define the time and place contexts for our case studies in the ipynb. This example will be requesting time subsets and receiving global data back from the Harmony API.
= "2010-01-01"
start_date = "2018-12-31" end_date
Data Access
Some features in the Harmony API require us to identify the target dataset/collection by its concept-id (which uniquely idenfifies it among the other datasets in the Common Metadata Repository). Support for selection by the dataset ShortName will be added in a future release.
Common Metadata Repository (CMR)
For now, we will need to get the concept-id that corresponds to our dataset by accessing its metadata from the CMR. Read more about the CMR at: https://cmr.earthdata.nasa.gov/
Request the UMM Collection metadata (i.e. metadata about the dataset) from the CMR and select the concept-id as a new variable ccid.
= requests.get(
response ='https://cmr.earthdata.nasa.gov/search/collections.umm_json',
url={'provider': "POCLOUD",
params'ShortName': ShortName,
'page_size': 1}
)
= response.json()['items'][0]
ummc
= ummc['meta']['concept-id']
ccid
ccid
'C1990404791-POCLOUD'
Harmony API
And get the Harmony API endpoint and zarr parameter like we did for SMAP before:
= f"https://harmony.earthdata.nasa.gov/{ccid}"
base = f"{base}/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset"
hreq = f"{hreq}?format=application/x-zarr"
rurl
print(rurl)
https://harmony.earthdata.nasa.gov/C1990404791-POCLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?format=application/x-zarr
ECCO monthly collections have 312 granules in V4r4 (you can confirm with the granule listing from CMR Search API) so we can get the entire time series for 2010 to 2018 with one request to the Harmony API.
Format a string of query parameters to limit the processing to the desired time period. Then, append the string of time subset parameters to the variable rurl
.
= '&'.join([f'subset=time("{start_date}T00:00:00.000Z":"{end_date}T23:59:59.999Z")'])
subs
= f"{rurl}&{subs}"
rurl
print(rurl)
https://harmony.earthdata.nasa.gov/C1990404791-POCLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?format=application/x-zarr&subset=time("2010-01-01T00:00:00.000Z":"2018-12-31T23:59:59.999Z")
Submit the request and monitor the processing status in a while loop, breaking it on completion of the request job:
= requests.get(url=rurl).json()
response
# Monitor status in a while loop. Wait 10 seconds for each check.
= 10
wait while True:
= requests.get(url=response['links'][0]['href']).json()
response if response['status']!='running':
break
print(f"Job in progress ({response['progress']}%)")
time.sleep(wait)
print("DONE!")
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
Job in progress (0%)
DONE!
Access the staged cloud datasets over native AWS interfaces
Check the message field in the response for clues about how to proceed:
print(response['message'])
The job has completed successfully. Contains results in AWS S3. Access from AWS us-west-2 with keys from https://harmony.earthdata.nasa.gov/cloud-access.sh
The third item in the list of links contains the shell script from the job status message printed above. Let’s download the same information in JSON format. It should be the fourth item; check to be sure:
len(response['links'])
102
Select the url and download the json, then load to Python dictionary and print the keys:
with requests.get(response['links'][3]['href']) as r:
= r.json()
creds
print(creds.keys())
dict_keys(['AccessKeyId', 'SecretAccessKey', 'SessionToken', 'Expiration'])
Check the expiration timestamp for the temporary credentials:
'Expiration'] creds[
'2021-06-11T02:36:29.000Z'
Open zarr datasets with s3fs and xarray
Get the s3 output directory and list of zarr datasets from the list of links. The s3 directory should be the fifth item; the urls are from item six onward:
= response['links'][4]['href']
s3_dir
print(s3_dir)
s3://harmony-prod-staging/public/harmony/netcdf-to-zarr/2295236b-8086-4543-9482-f524a9f2d0c3/
Now select the URLs for the staged files and print the first one:
= [u['href'] for u in response['links'][5:]]
s3_urls
print(s3_urls[0])
s3://harmony-prod-staging/public/harmony/netcdf-to-zarr/2295236b-8086-4543-9482-f524a9f2d0c3/OCEAN_BOTTOM_PRESSURE_mon_mean_2009-12_ECCO_V4r4_latlon_0p50deg.zarr
Use the AWS s3fs package and your temporary aws_creds
to open the zarr directory storage:
= s3fs.S3FileSystem(
s3 =creds['AccessKeyId'],
key=creds['SecretAccessKey'],
secret=creds['SessionToken'],
token={'region_name':'us-west-2'},
client_kwargs
)
len(s3.ls(s3_dir))
97
Plot the first Ocean Bottom Pressure dataset
Check out the documentation for xarray’s open_zarr method at this link. Open the first dataset and plot the OBP variable:
= xr.open_zarr(s3.get_mapper(s3_urls[0]), decode_cf=True, mask_and_scale=True)
ds0
# Mask the dataset where OBP is not within the bounds of the variable's valid min/max:
= ds0.where((ds0.OBP>=ds0.OBP.valid_min) & (ds0.OBP<=ds0.OBP.valid_max))
ds0_masked
# Plot the masked dataset
=0).plot.imshow(size=10) ds0_masked.OBP.isel(time
Load the zarr datasets into one large xarray dataset
Load all the datasets in a loop and concatenate them:
= xr.concat([xr.open_zarr(s3.get_mapper(u)) for u in s3_urls], dim="time")
zds
print(zds)
<xarray.Dataset>
Dimensions: (latitude: 360, longitude: 720, nv: 2, time: 97)
Coordinates:
* latitude (latitude) float64 -89.75 -89.25 -88.75 ... 89.25 89.75
latitude_bnds (latitude, nv) float64 -90.0 -89.5 -89.5 ... 89.5 89.5 90.0
* longitude (longitude) float64 -179.8 -179.2 -178.8 ... 179.2 179.8
longitude_bnds (longitude, nv) float64 -180.0 -179.5 -179.5 ... 179.5 180.0
* time (time) datetime64[ns] 2009-12-16T12:00:00 ... 2017-12-16T...
time_bnds (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
Dimensions without coordinates: nv
Data variables:
OBP (time, latitude, longitude) float64 dask.array<chunksize=(1, 360, 720), meta=np.ndarray>
OBPGMAP (time, latitude, longitude) float64 dask.array<chunksize=(1, 360, 720), meta=np.ndarray>
Attributes: (12/57)
Conventions: CF-1.8, ACDD-1.3
acknowledgement: This research was carried out by the Jet Pr...
author: Ian Fenty and Ou Wang
cdm_data_type: Grid
comment: Fields provided on a regular lat-lon grid. ...
coordinates_comment: Note: the global 'coordinates' attribute de...
... ...
time_coverage_duration: P1M
time_coverage_end: 2010-01-01T00:00:00
time_coverage_resolution: P1M
time_coverage_start: 2009-12-01T00:00:00
title: ECCO Ocean Bottom Pressure - Monthly Mean 0...
uuid: 297c8df0-4158-11eb-b208-0cc47a3f687b
Reference OBP and mask the dataset according to the valid minimum and maximum:
Get the valid min and max from the corresponding CF attributes:
= obp.valid_min, obp.valid_max
obp_vmin, obp_vmax
obp_vmin, obp_vmax
(-1.7899188995361328, 72.07011413574219)
Mask the dataset according to the OBP min and max and plot a series:
# Mask dataset where not inside OBP variable valid min/max:
= zds.where((obp>=obp_vmin)&(obp<=obp_vmax))
zds_masked
# Plot SSH again for the first 12 time slices:
= zds_masked.OBP.isel(time=slice(0, 6)).plot(
obpp ="longitude",
x="latitude",
y="time",
col=8,
levels=3,
col_wrap=False,
add_colorbar=(14, 8)
figsize
)
# Plot a colorbar on a secondary axis
= obpp.axes[0][0].collections[0]
mappable = plt.axes([0.05, -0.04, 0.95, 0.04])
cax = plt.colorbar(mappable, cax=cax, orientation='horizontal') cbar1