SMCEFR: Sentinel-3 Satellite Dataset

SMCEFR: Sentinel-3 Satellite Dataset

on by Cade Brown


An open science dataset for machine learning and analysis, sourced from the Sentinel-3 mission data.

I was asked to sponsor a dataset for ORNL’s Smoky Mountain Conference, in addition to posing a number of challenge questions. Then, researchers of all levels (undergrad to post-grad and beyond) were invited to investigate the dataset and write papers about their findings.

In order to do that, I chose to collect and process data from the Sentinel-3 satellite missions, which is basically a bird’s eye view of the Earth from space. The data is freely available, and I wrote a script to download it and process it into a format that is easy to work with. You can see the source code on GitHub. You can also download the dataset at the releases page

I am also a reviewer and judge for the papers, so I’m excited to see what people will come up with!

Introduction

Figure 1: A sample of 18 images from the smcefr-mini dataset
Figure 1: A sample of 18 images from the smcefr-mini dataset

Satellite data is important for many environmental sciences, as they give scientists a bird’s eye view of large areas of the Earth. Modern orbital sensor allow them to collect additional research input remotely with high precision in order to emply data-intensive analysis workflows. For example, the Sentinel-3 satellite collects images that are used by scientists for a wide variety of research tasks, such as monitoring ocean and land surface temperatures, bootstrapping models for weather forecasts or atmospheric conditions’ prediction, and modeling and predicting climate change.

However, in its original form, the data size and the format are unwieldy for rapid data exploration. Our goal was to remedy this by simplifying the dataset. To this end, it has been prepossessed to use a more approachable format, which is described in Data Source, but here is a brief overview:

SMCEFR: Sentinel-3 Satellite is a dataset consisting of 1024x1024 red-gree-blue (RGB) images generated from the Sentinel-3 satellite via the Ocean and Land Color Instrument (OLCI). In order to facilitate the competition challenge we applied simplifications to the original and reduced the data volume. This preprocessing created a subset of the data and produced RGB images that are easier to analyze with model prototypes using tools like Python and NumPy/Pillow modules or Tensorflow or OpenCV in C++ and many other modern image processing software. Our goal was to give the participants the ability to rapidly test and visualize their algorithms. Furthermore, we posed challenge questions to guide the potential research directions for the participants could explore to present their own insights. In particular, we encourage approaches originating from computer vision, numerical programming, and machine learning.

Data Source

The original data for this dataset came from the Copernicus Open Access Hub, which hosts large-scale historical records allowing access to multi-sensor data. However, the hub’s access policy requires a complicated workflow to query, filter, download, and preprocess the available datasets. We recongize the need of the full applications to have accesss to the extra geospatial and spectral information. While this could be extremely useful, for the purposes of a data challenge competition with a broad appeal, a smaller and more manageable dataset is desirable. In the following, we summarize the specifications that guided the creation of this dataset, which we will call SMCEFR (SMC’s Earth Full Resolution):

In the end, we produced smaller scale datasets in specific sizes, and encoded as traditional RGB images (PNG image format) with a 1024x1024 size. Figure 1 presents a sample of 18 images from our reduced data set. These are available from the GitHub release page, and are meant to be easily accessible using any of the following software packages or workflows:

Filename Schema

Figure 2: Sentinel-3 Generic Filename Schema Image
Figure 2: Sentinel-3 Generic Filename Schema Image

For clear identification, the PNG images in the dataset are named by followiing a strict scheme and the details of the format are described in Figure 2. Although this information is auxiliary for most intended processing scenarios, it could serve as additional input and guide supervision during training.

Accessing Dataset

The dowload is compact: it is a single Tar file compressed with Gzip (smcefr-mini.tar.gz). It can be expanded with the command:

Terminal window
$ tar -xvf smcefr-full.tar.gz

Which will create a directory that contains all the PNG files we selected. These can then be easily read with the OpenCV library, Python’s Pillow, Tensorflow, and many other media frameworks.

Challenge Questions

To motivate the potential analysis methods for the dataset, we present below sample challenge questions and directions that explore the prospective ideas in data science, computer vision, and machine learning.

Q1: Cloud identification

What methods can be used to segment and identify the regions of the images that are partially or completely obstructed by the cloud cover?

Potential approaches:

Q2: Noise Removal

Consider a case where the sensor data is incomplete, of varied quality, or totally degraded. Is there any way to take partially damaged/noisy data, and reconstruct something “closer” to the original sensor reading?

Be sure to consider the measure “closeness” carefully. For example, can useful metrics (PSNR, MSE) be used to compare performance of various methods?

For this task, we suggest creating copies of the “ground truth” dataset, and then adding a random amount of noise, followed by processing the copy as input.

Q3: Image Compression

To preserve data integrity, the dataset is provided as PNG, in the lossless data compression format mode. However, for a variety of purposes, it would be useful to allow a small amount of error in exchange for an appreciatiable size reduction. What can research methods be used to save space while keeping the best quality possible?