Open source packages - Quarto, Shiny, and more

Geospatial distributed processing with furrr

back view of a man sitting at desk working on computer with computer terminal framing him
Written by Ryan Garnett
2023-01-23
Hex stickers from the tidyverse, sf, gt, future, and furrr package on top of a map of the Nova Scotia region in Canada.

This is a guest post from Ryan Garnett, Data Management Insights & Analytics Manager at Green Shield Canada. Ryan works to bridge technical teams and senior executives by effectively communicating complex transformational change, achieved through international experience in technology development, operational management, client service, and strategic development. Keep in touch on LinkedIn.

The challenge

Everything happens somewhere. Location is an important factor when people make decisions that influences various life decisions, from buying a house, choosing a school, to vacations, or selecting a restaurant to eat out at. Within the data space utilizing location data is becoming increasingly important in data analysis, data science, and machine learning. Data wrangling is an integral part of data science, typically occupying a significant portion of time and effort, with location data adding to the complexity.

Location data can be large, consisting of hundreds of thousands to billions of observations in a single dataset. With location-based analysis consistently utilizing multiple datasets, the time to complete the analysis can be lengthy. Similar to other data analytics, the majority of location-based analysis tasks are performed sequentially using a single processor on a computer, even though many of the operations performed have no dependency on the other data or their outcome; if only there were another way…well, let me introduce you to distributed location-based processing.

What is distributed processing? In short, distributed processing is the use of multiple computer processors to execute a single computational task. A distributed process is executed by splitting a task into smaller chunks which are simultaneously performed on multiple processors. We can use distributed processing to make our code run faster, sometimes much faster. For large geospatial datasets that require a lot of computational power, distributed processing can reduce the time needed to handle complex data manipulation.

In this post, we’ll explore different techniques for wrangling location data, the processing time required for each, and the significant improvements that can be seen with distributed processing.

Setup

We performed the analysis for the post on a System76 Ubuntu laptop with the following configurations:

                               
1 Operating System Ubuntu 22.04
2            Cores            8
3        CPU speed      1.80GHz
4              RAM         32GB

Load packages

We use a number of packages in this post, some standard data analytics packages ({dplyr}, {janitor}, {ggplot2}, {gt}, and {readr}), as well as a few niche packages ({furrr}, {future}, {leaflet}, and {sf}).

  • The goal of {furrr} is to combine {purrr}’s family of mapping functions with {future}’s parallel processing capabilities: {furrr} R package

  • The {future} package provides unified parallel and distributed processing in R for everyone: {future} R package

  • Leaflet is one of the most popular open-source JavaScript libraries for interactive maps: {leaflet} R package

  • The {sf} package provides access, data wrangling, and data analysis for location data: {sf} R package

# Importing data
library(readr)

# Data analysis and wrangling
library(dplyr)
library(janitor)
library(sf)

# Distributed processing
library(furrr)
library(future)

# Visualization and styling
library(ggplot2)
library(gt)
library(leaflet)

Helper functions

We will use location data from different sources with different attribute standards. To assist with the data wrangling, we create custom functions to help standardize attributes available within the source datasets.

# Add provincial abbreviation column from provincial name
create_prov_abbrv_from_name <- function(.data, prov_column){
  .data %>%
    dplyr::mutate(PROV = dplyr::case_when(
      {{prov_column}} == "Alberta" ~ "AB",
      {{prov_column}} == "British Columbia" ~ "BC",
      {{prov_column}} == "Manitoba" ~ "MB",
      {{prov_column}} == "New Brunswick" ~ "NB",
      {{prov_column}} == "Newfoundland and Labrador" ~ "NL",
      {{prov_column}} == "Northwest Territories" ~ "NT",
      {{prov_column}} == "Nova Scotia" ~ "NS",
      {{prov_column}} == "Nunavut" ~ "NU",
      {{prov_column}} == "Ontario" ~ "ON",
      {{prov_column}} == "Prince Edward Island" ~ "PE",
      {{prov_column}} == "Quebec" ~ "QC",
      {{prov_column}} == "Saskatchewan" ~ "SK",
      {{prov_column}} == "Undersea Feature" ~ "Non Province",
      {{prov_column}} == "Yukon" ~ "YK",
      TRUE ~ "Non Canadian"
    ))
}

# Add provincial abbreviation column from provincial id value
create_prov_abbrv_from_census_num <- function(.data, column_name){
  .data %>%
    dplyr::mutate(PROV = dplyr::case_when(
      {{column_name}} == 10 ~ "NL",
      {{column_name}} == 11 ~ "PE",
      {{column_name}} == 12 ~ "NS",
      {{column_name}} == 13 ~ "NB",
      {{column_name}} == 24 ~ "QC",
      {{column_name}} == 35 ~ "ON",
      {{column_name}} == 46 ~ "MB",
      {{column_name}} == 47 ~ "SK",
      {{column_name}} == 48 ~ "AB",
      {{column_name}} == 59 ~ "BC",
      {{column_name}} == 60 ~ "YK",
      {{column_name}} == 61 ~ "NT",
      {{column_name}} == 62 ~ "NU",
      TRUE ~ "Non Canadian"
    ))
}

# Change case of sf geometry column
change_sf_geometry_case <- function(df){
  names(df)[names(df) == "geometry"] <- "GEOMETRY" 
  attr(df, "sf_column") <- "GEOMETRY"
  df
}

Source data

Location data, also known as geospatial data, is a specific type of data that stores geographic coordinates for use in analysis and visualizations. There are two main types of location data: vector and raster. This blog post will leverage vector data, which consists of points, lines, or polygons.

  • Points A unique location represented by X and Y locations, such as:

    • address location
    • place names
    • points of interest (POI)
  • Lines A series of multiple point locations connected together in a linear path, such as:

    • rivers
    • roads
    • trails
  • Polygons A set of enclosed connected points that represent a defined area, such as:

    • countries
    • land use
    • water bodies

For this blog, we will use point and polygon vector data, specifically Canadian postal code (points) and Census Canada economic regions (polygons). For more information on the data sources, see the two links below.

# Open source Canadian postal codes
# Download, import, and transform data
can_postal_codes <-
  readr::read_csv(
    "https://raw.githubusercontent.com/ccnixon/postalcodes/master/CanadianPostalCodes.csv"
  ) %>%
  janitor::clean_names("all_caps") %>%
  create_prov_abbrv_from_census_num(FSA_PROVINCE) %>%
  dplyr::select(FSA,
                POSTAL_CODE,
                PLACE_NAME,
                PROV,
                AREA_TYPE,
                LATITUDE,
                LONGITUDE) %>%
  sf::st_as_sf(coords = c('LONGITUDE', 'LATITUDE'),
               crs = 4326) %>%
  sf::st_make_valid() %>%
  change_sf_geometry_case()

# Statistics Canada economic regions data
# Download, import, and transform source zip file data
download.file(
  "https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/ler_000b21a_e.zip",
  destfile = "/data/ler_000b21a_e.zip"
)

# Unzip file
system("unzip /data/ler_000b21a_e.zip")

# Import and transform data
can_eco_regions <- sf::st_read("~/R/data/ler_000b21a_e.shp") %>%
  janitor::clean_names("all_caps") %>%
  create_prov_abbrv_from_census_num(PRUID) %>%
  sf::st_transform(crs = 4326) %>%
  sf::st_make_valid() %>%
  dplyr::select(ERUID, ERNAME, PROV) %>%
  change_sf_geometry_case()

Data exploration

We will use two datasets for this analysis, Canadian postal code points and Canadian Census economic polygons. Both datasets have location information and have undergone data wrangling to standardize their attributes. Let’s take a quick look at the volume of each of the datasets.

Dataset Number of Columns Number of Rows
Postal codes 6 889,291
Economic regions 4 76

The postal code point dataset has the largest volume of data distributed across Canada. The following table outlines the number of postal code points in each province and territory.

Province Number of Points
AB 86,977
BC 132,827
MB 27,833
NB 59,563
NL 11,344
NS 26,924
NT 604
NU 21
ON 295,166
PE 3,487
QC 218,745
SK 24,711
YK 1,089

While the volume of data in the polygon dataset is significantly smaller compared to the points layer, it should be noted that each polygon covers a large geographic area. The number of economic regions ranges from 1 to 17, depending on the province or territory.

Province Number of Polygons
AB 8
BC 8
MB 8
NB 5
NL 4
NS 5
NT 1
NU 1
ON 11
PE 1
QC 17
SK 6
YK 1

The postal code points are not evenly distributed across the economic regions. Areas with larger populations and urban development have a greater number of postal codes, as illustrated on the map of Nova Scotia, where a high volume of points are positioned in two main areas.