import requests
from os import makedirs
from os.path import isdir, basename
from urllib.parse import urlencode
from urllib.request import urlopen, urlretrieve
from datetime import datetime, timedelta
from json import dumps, loads
import earthaccess
from earthaccess import Auth, DataCollections, DataGranules, Store
From the PO.DAAC Cookbook, to access the GitHub version of the notebook, follow this link.
Table of Contents
Access Sentinel-6 Data by Cycle and Pass Number
This notebook shows a simple way to search for Sentinel-6 data granules for a specific cycle and pass using the CMR Search API and download them to a local directory.
Before you start
Before you beginning this tutorial, make sure you have an Earthdata account https://urs.earthdata.nasa.gov.
Accounts are free to create and take just a moment to set up.
Import Libraries
Authentication with earthaccess
In this notebook, we will be calling the authentication in the below cell.
= earthaccess.login(strategy="interactive", persist=True) auth
Find granules by cycle/pass number
The CMR Search API provides for searching ingested granules by their cycle and pass numbers. A third parameter, the tile
identifier, is provisioned for use during the upcoming SWOT mission but isn’t used by CMR Search at this time. Read more about these orbit identifiers here.
Passes within a cycle are unique, there will be no repeats until the next cycle. Tile numbers are only unique within a pass, so if you’re looking only at tile numbers there will be over 300 per cycle, but only 1 per pass.
Info below may only apply to NRT use case:
This workflow/notebook can be run routinely to maintain a time series of NRT data, downloading new granules as they become available in CMR.
The notebook writes/overwrites a file
.update
to the target data directory with each successful run. The file tracks to date and time of the most recent update to the time series of NRT granules using a timestamp in the formatyyyy-mm-ddThh:mm:ssZ
.The timestamp matches the value used for the
created_at
parameter in the last successful run. This parameter finds the granules created within a range of datetimes. This workflow leverages thecreated_at
parameter to search backwards in time for new granules ingested between the time of our timestamp and now.
The variables in the cell below determine the workflow behavior on its initial run:
trackcycle
andtrackpass
: Set the cycle and pass numbers to use for the CMR granule search.cmr
: The domain of the target CMR instance, eithercmr.earthdata.nasa.gov
.ccid
: The unique CMRconcept-id
of the desired collection.data
: The path to a local directory in which to download/maintain a copy of the NRT granule time series.
= "cmr.earthdata.nasa.gov"
cmr
# this function returns a concept id for a particular dataset
def get_collection(url: str=f"https://{cmr}/search/collections.umm_json", **params):
return requests.get(url, params).json().get("items")[0]
#
# This cell accepts parameters from command line with papermill:
# https://papermill.readthedocs.io
#
# These variables should be set before the first run, then they
# should be left alone. All subsequent runs expect the values
# for cmr, ccid, data to be unchanged. The mins value has no
# impact on subsequent runs.
#
= 25
trackcycle = 1
trackpass
= "JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F"
name
= get_collection(ShortName=name).get("meta").get("concept-id")
ccid
= "resources/trackcycle" data
The variable data
is pointed at a nearby folder resources/cyclepass
by default. You should change data
to a suitable download path on your file system. An unlucky sequence of git commands could disappear that folder and its downloads, if your not careful. Just change it.
The search retrieves granules ingested during the last n
minutes. A file in your local data dir file that tracks updates to your data directory, if one file exists. The CMR Search falls back on the ten minute window if not.
#timestamp = (datetime.utcnow()-timedelta(minutes=mins)).strftime("%Y-%m-%dT%H:%M:%SZ")
#timestamp
This cell will replace the timestamp above with the one read from the .update
file in the data directory, if it exists.
if not isdir(data):
print(f"NOTE: Making new data directory at '{data}'. (This is the first run.)")
makedirs(data)#else:
# try:
# with open(f"{data}/.update", "r") as f:
# timestamp = f.read()
# except FileNotFoundError:
# print("WARN: No .update in the data directory. (Is this the first run?)")
# else:
# print(f"NOTE: .update found in the data directory. (The last run was at {timestamp}.)")
NOTE: Making new data directory at 'resources/trackcycle'. (This is the first run.)
There are several ways to query for CMR updates that occured during a given timeframe. Read on in the CMR Search documentation:
- https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-new-granules (Collections)
- https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-revised-granules (Collections)
- https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-production-date (Granules)
- https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at (Granules)
The created_at
parameter works for our purposes. It’s a granule search parameter that returns the records ingested since the input timestamp.
= {
params 'scroll': "true",
'page_size': 2000,
'sort_key': "-start_date",
'collection_concept_id': ccid,
#'created_at': timestamp,
# Limit results to granules matching cycle, pass numbers:
'cycle': trackcycle,
'passes[0][pass]': trackpass,
}
params
{'scroll': 'true',
'page_size': 2000,
'sort_key': '-start_date',
'collection_concept_id': 'C1968980576-POCLOUD',
'cycle': 25,
'passes[0][pass]': 1}
Get the query parameters as a string and then the complete search url:
= urlencode(params)
query = f"https://{cmr}/search/granules.umm_json?{query}"
url print(url)
https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&collection_concept_id=C1968980576-POCLOUD&cycle=25&passes%5B0%5D%5Bpass%5D=1
Download the granule records that match our search parameters.
with urlopen(url) as f:
= loads(f.read().decode())
results
print(f"{results['hits']} granules results for '{ccid}' cycle '{trackcycle}' and pass '{trackpass}'.")
1 granules results for 'C1968980576-POCLOUD' cycle '25' and pass '1'.
Neatly print the first granule’s data for reference (assuming at least one was returned).
if len(results['items'])>0:
#print(dumps(results['items'][0], indent=2)) #print whole record
print(dumps(results['items'][0]['umm']["RelatedUrls"], indent=2)) #print associated URLs
# Also, replace timestamp with one corresponding to time of the search.
#timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
[
{
"URL": "s3://podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.nc",
"Type": "GET DATA VIA DIRECT ACCESS",
"Description": "This link provides direct download access via S3 to the granule."
},
{
"URL": "s3://podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.bufr.bin",
"Type": "GET DATA VIA DIRECT ACCESS",
"Description": "This link provides direct download access via S3 to the granule."
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.nc",
"Description": "Download S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.nc",
"Type": "GET DATA"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.xfdumanifest.xml",
"Description": "Download S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.xfdumanifest.xml",
"Type": "EXTENDED METADATA"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.bufr.bin",
"Description": "Download S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.bufr.bin",
"Type": "GET DATA"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
"Description": "api endpoint to retrieve temporary credentials valid for same-region direct s3 access",
"Type": "VIEW RELATED INFORMATION"
},
{
"URL": "https://opendap.earthdata.nasa.gov/collections/C1968980576-POCLOUD/granules/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02",
"Type": "USE SERVICE API",
"Subtype": "OPENDAP DATA",
"Description": "OPeNDAP request URL"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.data_01.ku.ssha.png",
"Type": "GET RELATED VISUALIZATION",
"Subtype": "DIRECT DOWNLOAD",
"MimeType": "image/png"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.data_01.ku.swh_ocean.png",
"Type": "GET RELATED VISUALIZATION",
"Subtype": "DIRECT DOWNLOAD",
"MimeType": "image/png"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.data_01.ku.sea_state_bias.png",
"Type": "GET RELATED VISUALIZATION",
"Subtype": "DIRECT DOWNLOAD",
"MimeType": "image/png"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.data_01.ku.atm_cor_sig0.png",
"Type": "GET RELATED VISUALIZATION",
"Subtype": "DIRECT DOWNLOAD",
"MimeType": "image/png"
},
{
"URL": "https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.data_01.ku.sig0_ocean.png",
"Type": "GET RELATED VISUALIZATION",
"Subtype": "DIRECT DOWNLOAD",
"MimeType": "image/png"
}
]
The link for http access denoted by "Type": "GET DATA"
in the list of RelatedUrls
.
Grab the download URL, but do it in a way that’ll work for search results returning any number of granule records:
= []
downloads
for l in results['items'][0]['umm']["RelatedUrls"]:
#if the link starts with the following, it is the download link we want
if 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/' in l['URL']:
#we want the .nc file
if '.nc' in l['URL']:
'URL'])
downloads.append(l[ downloads
['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.nc']
Finish by downloading the files to the data directory in a loop. Overwrite .update
with a new timestamp on success.
for f in downloads:
try:
0]], f"{data}/{basename(f)}")
earthaccess.download([f[except Exception as e:
print(f"[{datetime.now()}] FAILURE: {f}\n\n{e}\n")
raise e
else:
print(f"[{datetime.now()}] SUCCESS: {f}")
Error while downloading the file h
Traceback (most recent call last):
File "/Users/walschots/anaconda3/lib/python3.10/site-packages/earthaccess/store.py", line 483, in _download_file
with session.get(
File "/Users/walschots/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/Users/walschots/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 573, in request
prep = self.prepare_request(req)
File "/Users/walschots/anaconda3/lib/python3.10/site-packages/requests/sessions.py", line 484, in prepare_request
p.prepare(
File "/Users/walschots/anaconda3/lib/python3.10/site-packages/requests/models.py", line 368, in prepare
self.prepare_url(url, params)
File "/Users/walschots/anaconda3/lib/python3.10/site-packages/requests/models.py", line 439, in prepare_url
raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'h': No scheme supplied. Perhaps you meant http://h?
[2023-08-02 14:11:23.926892] SUCCESS: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F/S6A_P4_2__LR_RED__NR_025_001_20210713T162644_20210713T182234_F02.nc
If there were updates to the local time series during this run and no exceptions were raised during the download loop, then overwrite the timestamp file that tracks updates to the data folder (resources/nrt/.update
):
#if len(results['items'])>0:
# with open(f"{data}/.update", "w") as f:
# f.write(timestamp)