1 ESRI Satellite Images

We start off with the satellite images over the Paul Scherrer Institute in Villigen, Switzerland (acquired from ESRI ArcGIS Online)

1.1 Load the data in

Once the Spark Cluster has been created and you have the SparkContext called sc (automatically provided in Databricks Cloud or Zeppelin), the data can be loaded using the Spark Image Layer. The command readTiledImage loads in the data as tiles and can read mega-, giga-, even petabytes of data from Amazon’s S3, Hadoop Filesystem (HDFS), or any shared / network filesystem.

val psiBingImage = 
    sc.readTiledImage[Double]("s3n://geo-images/esri-satimg/psi*.png",256,256).cache

Although we execute the command one one machine, the data will be evenly loaded over all of the machines in the cluster (or cloud). The cache suffix keeps the files in memory so they can be read faster as many of our image processing tasks access the images a number of times.

1.2 Finding Reflective Objects

As our first task we want to identify all of the reflective objects, for this we - apply a threshold to the average intensity (\(\frac{red+blue+green}{3}\)) - identify the distinct regions - analyze the area and perimeter of the regions

1.2.1 Threshold / Segmentation

// Segment out all of the non-zero points
val pointImage = psiBingImage.sparseThresh(_.intensity>200).cache

1.2.2 Identify Regions

Apply component labeling and then filter to results to only keep the very large objects (the small ones are noise / not important)

// Label each chunk using connected component labeling with a 3 x 3 window
val uniqueRegions = ConnectedComponents.
  Labeling2DChunk(pointImage).filter(_.area>400)

1.2.3 Shape Analysis (Area and Perimeter)

// Run a shape analysis to measure the position, area, and intensity of each identified region
val shapeAnalysis = EllipsoidAnalysis.runIntensityShapeAnalysis(uniqueRegions)

1.3 Estimating Tree Coverage

To estimate the tree coverage is a slightly more difficult problem, but involves fundamentally the same types of analysis. For this we will use a different threshold criteria for identifying the trees and then apply some morphological operations to group nearby objects together.

1.3.1 Threshold / Segmentation

We identify the tree regions using a fairly simple rule with two criteria

  1. \(Green\gt\frac{Red+Blue}{2}\)
  2. \(Green\lt50\)

We also see that the code which we write, although it is parallel and running over a whole cluster of computers, mirrors the math nicely and contains none of the details of cluster or data management.

// Segment out all of points which meet the following criteria
val treeImage = psiBingImage.sparseThresh{
  pixVal =>
    // the green is brighter than the average of the red and the blue
    (pixVal.green>(pixVal.red+pixVal.blue)/2) &
    // the green value itself is dark (trees do not reflect much)
    (pixVal.green<50) 
  }.cache

1.3.2 Connecting nearby groups

We use the information in the neighborhood of each pixel to connect groups of nearby pixels together (two small adjacent clumps become one). This operation is known in image processing as a Morphological Close

// Perform 3 closing operations to connect the nearby tree regions together
val treeGroupImage = Morphology.close(treeImage,3) 

1.3.3 Identifying Regions

We apply component labeling and then filter to results to only keep the middle sized objects (too small are just artifacts or noise, too large is the river and other dark, green objects)

// Label each region using connected component labeling with a 3 x 3 window
val treeRegions = ConnectedComponents.
  Labeling2DChunk(treeGroupImage).filter{
  tRegion =>
    // we now remove the small single dark regions
    tRegion.area>1000 & 
    // since some of the river is classified as 'tree' as well we can remove all very large objects
    tRegion.area<500000
  }

1.3.4 Shape Analysis (Area, Perimeter, Concavity)

Now we can calculate the shape information for the tree areas to look at some of the statistics

// Run a shape analysis to measure the position, area, and intensity of each identified region
val shapeAnalysis = EllipsoidAnalysis.runIntensityShapeAnalysis(treeRegions)

We can then place the statistics back onto the map (for the largest ones)

1.4 Density Plots

Additionally metrics can be calculated like tree density and displayed on their own

Or projected back on top of the original data and a standard map

2 Acknowledgements

Analysis powered by Spark Image Layer from 4Quant, Visualizations, Document Generation, and Maps provided by:

To cite ggplot2 in publications, please use:

H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

A BibTeX entry for LaTeX users is

@Book{, author = {Hadley Wickham}, title = {ggplot2: elegant graphics for data analysis}, publisher = {Springer New York}, year = {2009}, isbn = {978-0-387-98140-6}, url = {http://had.co.nz/ggplot2/book}, }

To cite package ‘leaflet’ in publications use:

Joe Cheng and Yihui Xie (2014). leaflet: Create Interactive Web Maps with the JavaScript LeafLet Library. R package version 0.0.11. https://github.com/rstudio/leaflet

A BibTeX entry for LaTeX users is

@Manual{, title = {leaflet: Create Interactive Web Maps with the JavaScript LeafLet Library}, author = {Joe Cheng and Yihui Xie}, year = {2014}, note = {R package version 0.0.11}, url = {https://github.com/rstudio/leaflet}, }

To cite plyr in publications use:

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.

A BibTeX entry for LaTeX users is

@Article{, title = {The Split-Apply-Combine Strategy for Data Analysis}, author = {Hadley Wickham}, journal = {Journal of Statistical Software}, year = {2011}, volume = {40}, number = {1}, pages = {1–29}, url = {http://www.jstatsoft.org/v40/i01/}, }