From the PO.DAAC Cookbook, to access the GitHub version of the notebook, follow this link.

Table of Contents

Access Sentinel-6 NRT Data

This notebook shows a simple way to maintain a local time series of Sentinel-6 NRT data using the CMR Search API. It downloads granules the ingested since the previous run to a designated data folder and overwrites a hidden file inside with the timestamp of the CMR Search request on success.

Before you start

Before you beginning this tutorial, make sure you have an Earthdata account: https://urs.earthdata.nasa.gov for the operations envionrment (most common) or https://uat.urs.earthdata.nasa.gov for the UAT environment.

Accounts are free to create and take just a moment to set up.

import requests
from os import makedirs
from os.path import isdir, basename
from urllib.parse import urlencode
from urllib.request import urlopen, urlretrieve
from datetime import datetime, timedelta
from json import dumps, loads
import earthaccess
from earthaccess import Auth, DataCollections, DataGranules, Store

Authentication with earthaccess

In this notebook, we will be calling the authentication in the below cell.

auth = earthaccess.login(strategy="interactive", persist=True)

Hands-off workflow

This workflow/notebook can be run routinely to maintain a time series of NRT data, downloading new granules as they become available in CMR.

The notebook writes/overwrites a file .update to the target data directory with each successful run. The file tracks to date and time of the most recent update to the time series of NRT granules using a timestamp in the format yyyy-mm-ddThh:mm:ssZ.

The timestamp matches the value used for the created_at parameter in the last successful run. This parameter finds the granules created within a range of datetimes. This workflow leverages the created_at parameter to search backwards in time for new granules ingested between the time of our timestamp and now.

The variables in the cell below determine the workflow behavior on its initial run:

  • mins: Initialize a new local time series by starting with the granules ingested since ___ minutes ago.
  • cmr: The domain of the target CMR instance, either cmr.earthdata.nasa.gov.
  • ccid: The unique CMR concept-id of the desired collection.
  • data: The path to a local directory in which to download/maintain a copy of the NRT granule time series.
cmr = "cmr.earthdata.nasa.gov"

# this function returns a concept id for a particular dataset
def get_collection(url: str=f"https://{cmr}/search/collections.umm_json", **params):
    return requests.get(url, params).json().get("items")[0]

#
# This cell accepts parameters from command line with papermill: 
#  https://papermill.readthedocs.io
#
# These variables should be set before the first run, then they 
#  should be left alone. All subsequent runs expect the values 
#  for cmr, ccid, data to be unchanged. The mins value has no 
#  impact on subsequent runs.
#

mins = 20

name = "JASON_CS_S6A_L2_ALT_LR_RED_OST_NRT_F"

ccid = get_collection(ShortName=name).get("meta").get("concept-id")

data = "resources/nrt"

The variable data is pointed at a nearby folder resources/nrt by default. You should change data to a suitable download path on your file system. An unlucky sequence of git commands could disappear that folder and its downloads if your not careful. Just change it.

The Python imports relevant to the workflow

The search retrieves granules ingested during the last n minutes. A file in your local data dir file that tracks updates to your data directory, if one file exists. The CMR Search falls back on the ten minute window if not.

timestamp = (datetime.utcnow()-timedelta(minutes=mins)).strftime("%Y-%m-%dT%H:%M:%SZ")
timestamp
'2023-08-02T21:08:21Z'

This cell will replace the timestamp above with the one read from the .update file in the data directory, if it exists.

if not isdir(data):
    print(f"NOTE: Making new data directory at '{data}'. (This is the first run.)")
    makedirs(data)
else:
    try:
        with open(f"{data}/.update", "r") as f:
            timestamp = f.read()
    except FileNotFoundError:
        print("WARN: No .update in the data directory. (Is this the first run?)")
    else:
        print(f"NOTE: .update found in the data directory. (The last run was at {timestamp}.)")
WARN: No .update in the data directory. (Is this the first run?)

There are several ways to query for CMR updates that occured during a given timeframe. Read on in the CMR Search documentation:

  • https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-new-granules (Collections)
  • https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-revised-granules (Collections)
  • https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-production-date (Granules)
  • https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at (Granules)

The created_at parameter works for our purposes. It’s a granule search parameter that returns the records ingested since the input timestamp.

params = {
    'scroll': "true",
    'page_size': 2000,
    'sort_key': "-start_date",
    'collection_concept_id': ccid, 
    'created_at': timestamp,
    # Limit results to coverage for .5deg bbox in Gulf of Alaska:
    'bounding_box': "-146.5,57.5,-146,58",
}

params
{'scroll': 'true',
 'page_size': 2000,
 'sort_key': '-start_date',
 'collection_concept_id': 'C1968980576-POCLOUD',
 'created_at': '2023-08-02T21:08:21Z',
 'bounding_box': '-146.5,57.5,-146,58'}

Get the query parameters as a string and then the complete search url:

query = urlencode(params)
url = f"https://{cmr}/search/granules.umm_json?{query}"
print(url)
https://cmr.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&collection_concept_id=C1968980576-POCLOUD&created_at=2023-08-02T21%3A08%3A21Z&bounding_box=-146.5%2C57.5%2C-146%2C58

Get a new timestamp that represents the UTC time of the search. Then download the records in umm_json format for granules that match our search parameters:

with urlopen(url) as f:
    results = loads(f.read().decode())

print(f"{results['hits']} new granules ingested for '{ccid}' since '{timestamp}'.")

timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
0 new granules ingested for 'C1968980576-POCLOUD' since '2023-08-02T21:08:21Z'.

Neatly print the first granule record (if one was returned):

if len(results['items'])>0:
    print(dumps(results['items'][0], indent=2))

The link for http access can be retrieved from each granule record’s RelatedUrls field. The download link is identified by "Type": "GET DATA" .

Select the download URL for each of the granule records:

downloads = [[u['URL'] for u in r['umm']['RelatedUrls'] if u['Type']=="GET DATA"][0] for r in results['items']]
downloads
[]

Finish by downloading the files to the data directory in a loop. Overwrite .update with a new timestamp on success.

for f in downloads:
    try:
        earthaccess.download([f[0]], f"{data}/{basename(f)}")
    except Exception as e:
        print(f"[{datetime.now()}] FAILURE: {f}\n\n{e}\n")
        raise e
    else:
        print(f"[{datetime.now()}] SUCCESS: {f}")

If there were updates to the local time series during this run and no exceptions were raised during the download loop, then overwrite the timestamp file that tracks updates to the data folder (resources/nrt/.update):

if len(results['items'])>0:
    with open(f"{data}/.update", "w") as f:
        f.write(timestamp)