Welcome to MRSDS!

The Methane Research Science Data System (MRSDS) organization hosts repositories related to data processing, analysis, and visualization of atmospheric methane information at the full range of spatial scales including local, regional and global. The collection of software associated with this functionality is referred to as the Multi-Scale Methane Analytic Framework (M2AF). Please refer to Jacob, et al. 2021 for additional background information about M2AF software deployed across high performance computing and cloud computing platforms. The primary visualization component for this work is the Methane Source Finder (MSF) portal, and several of the code repositories relate to the MSF front-end, back-end, and data ingestion components.

The following describes the function of each repository in the MRSDS organization:

mrsds.github.io: This documentation.
msf-flow: Local plume processing pipeline, including data harvester, workflow deployment on AWS, wind processor, and plume post-processing and filtering.
msf-ui: This is the front-end map client code based on Common Mapping Client.
msf-ui-design: The Figma files with screen designs. Open them at http://figma.com
msf-portal-docker: Used to build Docker containers of the components needed to run the Methane Source Finder web application: msf-ui, msf-be and msf-static-layers.
msf-ingestion: Ingest point-source data and GeoTIFF plume-images into the methane portal database.
msf-static-layers: Gridded base layers of data; Dockerized.
msf-be: Back-end to methane portal. Includes setting up the point-source data base and providing all the server APIs used by msf-ui to get the data.
MethaneSourceFinder-BackEndDocker:
MethaneSourceFinder-FrontEndDocker:
MethaneSourceFinder-StaticContentDocker:
sdap_collections: Example SDAP dataset collection configuration.
sdap_notebooks: Example SDAP analytics in a Jupyter notebook.

In order to run the Methane Source Finder (MSF) web application you will need to use msf-ui (the web app), msf-be (the back-end that serves up APIs to provide data to the UI), and msf-static-layers (gridded data layers that are stored as flatfiles), plus a database with the point-source data and GeoTIFF plume-images (loaded by msf-ingestion).

In order to run the point-source data pipeline you will need msf-flow.

SDAP data ingestion and test Jupyter notebooks are in the sdap_collections and sdap_notebooks repos

Science Data Analytics Platform (SDAP)

We have demonstrated the use of the open-source Science Data Analytics Platform (SDAP) to perform basic analytics on our gridded methane data products, like our regional and global inversions. SDAP uses Apache Spark for parallel computations in the "map-reduce" style. In map-reduce computations, one or more "map" functions operate independently on different subsets of the data. Reduction operators combine the distributed map results to produce the final analytics product. The final product is expected to be smaller than the collection of input data files that were used to compute the result. In this way, SDAP performs the computations remotely, close to the data, and eliminates the need for large data file downloads.

SDAP Web Service API

Analytics requests to SDAP are done using a web service API that can be done in a variety of programming languages or in any web browser. Please refer to Jupyter Notebooks for examples of how to make calls to SDAP in Python.

SDAP Datastore

SDAP delivers rapid subsetting by using a tile-based datastore, instread of operating on many files. The data array for each variable of interest is partitioned into equal sized tiles, each covering a particular time range and coordinate bounds. These tiles are ingested into the SDAP datastore, which has two components:

Solr: Hosts tile attributes, including a unique identifier for each tile, spatial bounds, time range covered, and summary statistics; Enables rapid geospatial search for tiles that intersect a user-defined bounding box.
Cassandra: Hosts the actual data tiles, and enables each tile to be directly retrieved using its unique identifier, retrieved from Solr.

The SDAP algorithms rapidly access the necessary data subsets in a two-step process. First, it performs a geospatial search in Solr to extract the unique identifiers of the data tiles that intersect the time range and spatial area of interest. It then uses those tile identifiers as keys to directly access the tile data from Cassandra's key-value store.

SDAP Analytics Algorithms

The SDAP analytics algorithms that may highly relevant to multi-scale methane analysis are:

Area-averaged time series: Compute a spatial mean for each timestamp.
Time averaged map: Compute a time average for each spatial
Correlation map: Compute the correlation coefficient for two identically sampled/gridded datasets.
Daily difference average: Compute a spatial mean of the difference between a dataset and its climatology (long term average) for each timestamp.

All of the SDAP computations can be constrained to a spatiotemporal bounding box.

SDAP Deployment

SDAP is deployed using Kubernetes and Helm. A functional SDAP consists of the following components:

nexus-webapp-driver: 1 pod that listens for incoming web service calls (SDAP job requests), and routes them to the appropriate handler.
nexus-analysis-exec: 1 or more Apache Spark executor pods that perform a share of the work needed to fulfill an SDAP job request.
sdap-solr: 1 or more replicas of Solr, which is the component of the SDAP datastore that enables geospatial search for data tiles that intersect a spatio-temporal bounding box.
sdap-zookeeper: 1 or more replicas of Zookeeper, which is a sidecar app that handles syncronization and configuration required for Solr.
sdap-cassandra: 1 or more replicas of Cassandra, which is the component of the SDAP datastore that enables key lookup and retrieval of data tile values.
collection-manager: 1 pod that keeps track of the datasets that have been ingested in to an SDAP deployment. It watches a folder on AWS S3 or a directory on the local file system for new files to ingest for each dataset. When it finds a new file, it schedules an ingest job for that file on a queue in RabbitMQ that is being watched by the Granule Ingesters (see below).
rabbitmq: Deployment of RabbitMQ with a queue set up for ingest jobs consisting of a file to be ingested. This queue is populated by the Collection Manager component, and jobs are popped off the queue by the Granule Ingester(s).
granule-ingester: 1 or more ingest workers that pop ingest jobs off of a queue in RabbitMQ, slice the file to be ingested in to tiles, and ingest the tiles in to the Solr/Cassadra datastore.

SDAP Data Ingest

To run analytics on a dataset with SDAP you need to first "ingest" the dataset. Please refer to the SDAP helm deployment instructions to learn how to do deploy and ingest data in to SDAP. SDAP can support ingest from NetCDF4 or HDF5 files that follow CF Metadata Conventions.

As described in the SDAP documentation, you will need to compose a dataset configuration in YAML format in order to configure a dataset for ingest. Please refer to the examples we provided at Example collections.yaml.

Additional Recommendations to use SDAP

The following are some additional recommendations that we have found increase the likelihood that the SDAP ingesters will support the data files:

In your NetCDF files, you will provide latitude and longitude dimensions and also variables that provide values for those dimensions. You should ensure that the dimension name matches the corresponding variable name. For example, the header should have something like:
```
dimensions:
    lat = 100 ;
variables:
    float lat(lat) ;
	      
```
and do not do this:
```
dimensions:
    lat = 100 ;
variables:
    float latitude(lat) ;
	      
```
In your NetCDF files you will provide a time dimension and variable. You shoudl ensure your time variable has units of a form accepted by the CF Conventions. For example, this is an appropriate time unit:
```
time:units = "seconds since 1970-01-01T00:00:00Z"
	      
```
The variable being ingested should be arranged so that the dimensions are indexed in the following order:
1. time
2. latitude
3. longitude
For example, the following should work fine with SDAP:
```
float fluxes(time, lat, lon) ;
	      
```
The variable being ingested can have fill values for grid cells without valid values. You should identify the fill value using the standard variable attribute,
```
_FillValue
```

the end.