import os
import requests
import s3fs
import xarray as xr
import hvplot.xarray
How to Access Data Directly in Cloud (netCDF)
imported on: 2023-03-02
This notebook is from NASA Openscapes 2021 Cloud Hackathon Repository.
The original source for this document is https://github.com/NASA-Openscapes/2021-Cloud-Workshop-AGU/blob/main/how-tos/Multi-File_Direct_S3_Access_NetCDF_Example.ipynb
Accessing Multiple NetCDF4/HDF5 Files - S3 Direct Access
Summary
In this notebook, we will access monthly sea surface height from ECCO V4r4 (10.5067/ECG5D-SSH44). The data are provided as a time series of monthly netCDFs on a 0.5-degree latitude/longitude grid.
We will access the data from inside the AWS cloud (us-west-2 region, specifically) and load a time series made of multiple netCDF datasets into an xarray
dataset
. This approach leverages S3 native protocols for efficient access to the data.
Requirements
1. AWS instance running in us-west-2
NASA Earthdata Cloud data in S3 can be directly accessed via temporary credentials; this access is limited to requests made within the US West (Oregon) (code: us-west-2) AWS region.
2. Earthdata Login
An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.
3. netrc File
You will need a netrc file containing your NASA Earthdata Login credentials in order to execute the notebooks. A netrc file can be created manually within text editor and saved to your home directory. For additional information see: Authentication for NASA Earthdata.
Learning Objectives
- how to retrieve temporary S3 credentials for in-region direct S3 bucket access
- how to define a dataset of interest and find netCDF files in S3 bucket
- how to perform in-region direct access of ECCO_L4_SSH_05DEG_MONTHLY_V4R4 data in S3
- how to plot the data
Import Packages
Get Temporary AWS Credentials
Direct S3 access is achieved by passing NASA supplied temporary credentials to AWS so we can interact with S3 objects from applicable Earthdata Cloud buckets. For now, each NASA DAAC has different AWS credentials endpoints. Below are some of the credential endpoints to various DAACs:
= {
s3_cred_endpoint 'podaac':'https://archive.podaac.earthdata.nasa.gov/s3credentials',
'gesdisc': 'https://data.gesdisc.earthdata.nasa.gov/s3credentials',
'lpdaac':'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
'ornldaac': 'https://data.ornldaac.earthdata.nasa.gov/s3credentials',
'ghrcdaac': 'https://data.ghrc.earthdata.nasa.gov/s3credentials'
}
Create a function to make a request to an endpoint for temporary credentials. Remember, each DAAC has their own endpoint and credentials are not usable for cloud data from other DAACs.
def get_temp_creds(provider):
return requests.get(s3_cred_endpoint[provider]).json()
= get_temp_creds('podaac')
temp_creds_req #temp_creds_req
Set up an s3fs
session for Direct Access
s3fs
sessions are used for authenticated access to s3 bucket and allows for typical file-system style operations. Below we create session by passing in the temporary credentials we recieved from our temporary credentials endpoint.
= s3fs.S3FileSystem(anon=False,
fs_s3 =temp_creds_req['accessKeyId'],
key=temp_creds_req['secretAccessKey'],
secret=temp_creds_req['sessionToken'],
token={'region_name':'us-west-2'}) client_kwargs
In this example we’re interested in the ECCO data collection from NASA’s PO.DAAC in Earthdata Cloud. In this case it’s the following string that unique identifies the collection of monthly, 0.5-degree sea surface height data (ECCO_L4_SSH_05DEG_MONTHLY_V4R4).
= 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4' short_name
= os.path.join('podaac-ops-cumulus-protected/', short_name, '*2015*.nc')
bucket bucket
Get a list of netCDF files located at the S3 path corresponding to the ECCO V4r4 monthly sea surface height dataset on the 0.5-degree latitude/longitude grid, for year 2015.
= fs_s3.glob(bucket)
ssh_files ssh_files
Direct In-region Access
Open with the netCDF files using the s3fs package, then load them all at once into a concatenated xarray
dataset
.
= [fs_s3.open(file) for file in ssh_files] fileset
Create an xarray dataset
using the open_mfdataset()
function to “read in” all of the netCDF4 files in one call.
= xr.open_mfdataset(fileset,
ssh_ds ='by_coords',
combine=True,
mask_and_scale=True,
decode_cf='auto')
chunks ssh_ds
Get the SSH
variable as an xarray dataarray
= ssh_ds.SSH
ssh_da ssh_da
Plot the SSH
time series using hvplot
='latitude', x='longitude', cmap='Viridis',).opts(clim=(ssh_da.attrs['valid_min'][0],ssh_da.attrs['valid_max'][0])) ssh_da.hvplot.image(y
Resources
Direct access to ECCO data in S3 (from us-west-2)
Data_Access__Direct_S3_Access__PODAAC_ECCO_SSH using CMR-STAC API to retrieve S3 links