Using stars for remote big Earth Observation data processing

Edzer Pebesma

4 years ago

[This article was first published on r-spatial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[view raw Rmd]

Summary

This is the third blog on the stars project, an R-Consortium funded project for spatiotemporal tidy arrays with R. It shows how stars can be used in a client/server setup, where large amounts of Earth observation data are held on a server, and an (R) client is used to query the data, process it on the server side, and retrieve a (small) portion, for example to plot.

This blog is a proof of concept, and was developed for the use case of remotely processing a set of Sentinel 2 images; no serious effort has been made so far to generalize this to other image sources. We use the stars package both on server and client side.

The data

Our dataset consists of 36 zip files, containing Sentinel 2 scenes (or images, tiles, granules). Together they are 20 Gb. GDAL has a driver for such zip files, and hence stars can read them directly. Each zip file contains over 100 files, of which 15 are JPEG2000 files with imagery having 10, 20 or 60 m resolution. We focus on the 10 m bands (Red, Green, Blue, Near Infrared); single images of these have 10980 x 10980 pixels, but also come with overview (downsampled) levels fo 20, 40, 80 and 160 m. That is convenient, because for plotting it does not make much sense to transport more pixels than our screen has.

The data were downloaded from ESA’s SciHub, which first looks cumbersome but in the end it was quite easy using aria2c. (The aria2 file contains my user credentials.)

Server

The server side is set up using plumber:

library(plumber)
r <- plumb("server.R")
r$run(port=8000)

The server.R script is now explained in sections.

Reading data

The following script reads all the files ending in .zip, and creates names from them that are understood by the GDAL Sentinel 2 driver. The /vsizip prefix makes sure that the zip files is read as if it were unzipped. Function read_stars_meta reads as much as possible metadata from the set of files (and their subdatasets: here only subdataset 1, which has the 10 m x 10 m imagery), and returns it in a (still) somewhat chaotic nested list. The object md is an sf-tibble with names and bounding boxes (as polygon feature geometries) of every image.

library(jsonlite) # base64_enc
library(stars)
library(tibble)

# load some imagery data
# from a S2 .zip file, create a readable gdal source:
s2_expand_zip = function(s) {
    paste0("/vsizip/", s, "/", sub(".zip", ".SAFE", s), "/MTD_MSIL1C.xml")
}
lst = list.files(pattern = "*.zip")
l = lapply(s2_expand_zip(lst), read_stars_meta, sub = 1)
bb <- do.call(c, lapply(l, function(x) st_as_sfc(st_bbox(st_dimensions(x)))))
fn = sapply(l, function(x) attr(x, "file_names"))
md = st_sf(tibble(file = fn, geom = bb))

# global database
data = list(md = md)

Serving /data

The following function simply logs every request to screen and forwards it:

#* Log some information about the incoming request
#* @filter logger
function(req){
  cat(as.character(Sys.time()), "-", 
    req$REQUEST_METHOD, req$PATH_INFO, "-", 
    req$HTTP_USER_AGENT, "@", req$REMOTE_ADDR, ":", req$postBody, "\n")
  plumber::forward()
}

The final two functions do the real work: they read an expression, evaluate it and return the result (get), or they store the result on the server side (put):

# plumber REST end points /data:
#* @get  /data
#* @post /data
get_data <- function(req, expr = NULL) {
    print(expr)
    if (is.null(expr))
        names(data)
    else 
        base64_enc(serialize( eval(parse(text = expr), data), NULL)) # to char
}

#* @put /data
put_data <- function(req, name, value) {
    data[[name]] <<- unserialize(base64_dec(fromJSON(value)))
    NULL
}

This means that GET http://x.y/data gives a listing of available datasets, and GET http://x.y/data/obj retrieves object obj. PUT uploads objects to the database (called data).

Client

On the client side, we need to also do some work:

I/O with the server:

library(httr)     # GET, POST, PUT
library(jsonlite) # base64_dec, base64_enc, toJSON, fromJSON
## Loading required package: methods

get_data = function(url, expr = NULL) {
    if (is.null(expr))
        fromJSON( content(GET(url), "text", encoding = "UTF-8"))
    else
        unserialize(base64_dec(fromJSON( 
            content(POST(url, body = list(expr = expr), encode = "json"),
                "text", encoding = "UTF-8")
        )))
}

put_data = function(url, name, value) {
    value = toJSON(base64_enc(serialize(value, NULL)))
    PUT(url, body = list(name = name, value = value), encode = "json")
}

Somewhat confusingly I gave these the same name as on the server side, but these exist in a different R session: the one you interact with directly, the client. For get_data, the return value is decoded and unserialized to get a copy of the original R object on the server. This R object may have been present at the server, or may have been created on the server: since the server uses eval, it takes arbitrary R expressions.

Example client session: metadata

We can get a view of the data present on the server by

url = "http://localhost:8000/data"
get_data(url) # list
## [1] "md"

Next, we can retrieve the md (metadata) object:

library(stars)
## Loading required package: abind
## Loading required package: sf
## Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3
library(tibble)   # print
md = get_data(url, "md")
md
## Simple feature collection with 36 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 3e+05 ymin: 5590200 xmax: 809760 ymax: 6e+06
## epsg (SRID):    32631
## proj4string:    +proj=utm +zone=31 +datum=WGS84 +units=m +no_defs
## # A tibble: 36 x 2
##    file                                                               geom
##    <chr>                                                     <POLYGON [m]>
##  1 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((499980 5890200, 609780 5890200,…
##  2 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5890200, 709800 5890200, …
##  3 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((699960 5690220, 809760 5690220,…
##  4 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((3e+05 5690220, 409800 5690220, …
##  5 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((499980 5590200, 609780 5590200,…
##  6 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5590200, 709800 5590200, …
##  7 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5690220, 709800 5690220, …
##  8 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5790240, 709800 5790240, …
##  9 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5890200, 709800 5890200, …
## 10 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((699960 5590200, 809760 5590200,…
## # ... with 26 more rows

Query data

We will now create an RGB mosaic of the data, while reading them at overview level 3, meaning 160 m x 160 m pixels.

# select a country:
nl = st_as_sf(raster::getData("GADM", country = "NLD", level = 0)) %>%
    st_transform(st_crs(md))

file = md[nl,]$file
plot(st_geometry(nl), axes = TRUE)
s = sapply(file, function(x) {
    expr = paste0("read_stars(\"", x, "\", options = \"OVERVIEW_LEVEL=3\", NA_value = 0)")
    r = get_data(url, expr)
    image(r, add = TRUE, rgb = 4:2, maxColorValue = 15000)
})
plot(st_geometry(md), border = 'yellow', add = TRUE, lwd = .8)
plot(st_geometry(nl), add=TRUE, border = 'orange', lwd = 2)

The expression file = md[nl,]$file selects the file names from the image metadata md for those scenes (images) that intersect with the country (nl).

On top of the image (which shows mostly clouds) we plotted the country boundary, and the tile boundaries.

Process: `st_apply`, NDVI

This way, we can also create new variables on the server side and show them. As an example, we can use st_apply which does an apply to the arrays in a stars object (and takes care of the dimensions afterwards!), and define an NDVI function, which we upload to the server:

ndvi = function(x) (x[4]-x[1])/(x[4]+x[1])
put_data(url, "ndvi", ndvi)
## Response [http://localhost:8000/data]
##   Date: 2018-03-23 03:01
##   Status: 200
##   Content-Type: application/json
##   Size: 2 B
get_data(url)
## [1] "md"   "ndvi"
get_data(url, "ndvi")
## function(x) (x[4]-x[1])/(x[4]+x[1])

Next, we can query the server giving it an expression that applies the ndvi function to the (overview level 3) imagery, and returns the single-band (NDVI) image:

plot(st_geometry(nl), axes = TRUE)
s = sapply(file, function(x) {
    expr = paste0("st_apply(read_stars(\"", x, "\", options = \"OVERVIEW_LEVEL=3\", NA_value = 0),
        1:2, ndvi)")
    r = get_data(url, expr)
    image(r, add = TRUE, zlim = c(-1,1), col = colorRampPalette(c(grey(.1), grey(.5), 'green'))(10))
})
plot(st_geometry(nl), add = TRUE) # once more

The diagonal line is caused by the fact that we combine imagery from different satellite overpasses, of which most are pretty cloudy.

Computing NDVI is just one example; the Index database contains 200 more indexes that can be computed from Sentinel 2 data.

Limitations and lessons learned

As a prototype, this implementation has strong limitations. To use this in an architecture with multiple, concurrent users would require authentication, user management, and concurrent R sessions.
if we try to read one full 10 m scene with 4 bands from the server, we see the error long vectors not supported, indicating that the encoded data is over 2 Gb; file download and upload will be needed for larger data sizes, or more clever interaction patterns such as a web map service or tile map service.
to pass R objects, we need to serialize them, base64_enc that, and wrap the result in toJSON. On the receiver side, we need to do the reverse (fromJSON, base_64_dec, unserialize).
never try to put an R expression or its base64 encoding in a url (using GET) – use POST and put it in the message body; GET will do funny things with funny characters (even =).
the way we submit expressions looks pretty messy using character strings, there must be a more tidy way to do this.

Try this at home

No blog without a reproducible example! Start two R sessions, one for a client, one for a server. On the server sessions, run:

devtools::install_github("stars") # requires sf 0.6-1, which is now on CRAN
library(plumber)
r <- plumb(system.file("plumber/server.R", package = "stars"))
r$run(port=8000)

and keep this running! On the client session, run

source(system.file("plumber/client.R", package = "stars"),, echo = TRUE)

and watch for the logs from the server. In the client, you can now

plot(xx)

to see an object that was read in the server session.

The way forward

Where this example worked locally, and thus automatically with “small” large data (20 Gb imagery), in next steps we will develop and deploy this with “large” large data. Amazon hosts large collections of Landsat 8 and Sentinel 2 L1C data, and with the setup sketched above it should be fairly straightforward to directly compute on these data, given access to a cloud back-end machine. The R Consortium has sponsored the Earth data processing backend for testing and evaluating stars project, which has a budget for a storage and/or compute back-end for evaluation purposes.

openEO is a project that develops an open API to connect R, python and javascript clients to big Earth observation cloud back-ends in a simple and unified way. That is a similar ambition, but larger. openEO uses a process graph (directed, acyclic graph) to represent the expression, which must be acceptable for all back-ends. Here, we have R on both sides of the wire, which makes life much more simple. But openEO is also a cool project: check out the proof of concept videos!

Earlier stars blogs

first stars blog
second stars blog

To leave a comment for the author, please follow the link and comment on their blog: r-spatial.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.