Open source packages - Quarto, Shiny, and more Commercial enterprise offerings

Color coding your data in {gt} 0.9.0

Written by Rich Iannone
2023-05-17
Text: "Color coding your data, gt 0.9.0" and the gt hex sticker on the left, a table with cells of different colors on the right.

There are many improvements in the new 0.9.0 release of gt! In fact, there is so much that is new that we couldn’t fit it all in a single blog post. This blog post (number three in a larger series on gt 0.9.0) focuses on the improvements to data_color(), a function that lets you perform data cell colorization.

A basic example on how to use data_color()

Let’s introduce the data_color() function with a simple example. For the sake of simplicity, let’s use gt’s exibble dataset for this:

exibble |>
  gt() |>
  data_color()
num char fctr date time datetime currency row group
1.111e-01 apricot one 2015-01-15 13:35 2018-01-01 02:22 49.950 row_1 grp_a
2.222e+00 banana two 2015-02-15 14:40 2018-02-02 14:33 17.950 row_2 grp_a
3.333e+01 coconut three 2015-03-15 15:45 2018-03-03 03:44 1.390 row_3 grp_a
4.444e+02 durian four 2015-04-15 16:50 2018-04-04 15:55 65100.000 row_4 grp_a
5.550e+03 NA five 2015-05-15 17:55 2018-05-05 04:00 1325.810 row_5 grp_b
NA fig six 2015-06-15 NA 2018-06-06 16:11 13.255 row_6 grp_b
7.770e+05 grapefruit seven NA 19:10 2018-07-07 05:22 NA row_7 grp_b
8.880e+06 honeydew eight 2015-08-15 20:20 NA 0.440 row_8 grp_b

What’s happened is that data_color() applies background colors to all cells of every column with the default palette in R (internally accessed through the grDevices::palette() function). The default method for applying color is "auto", and this is through the new method argument. With method = "auto", gt will decide on a column-by-column basis which colorization method to use. For numeric values, the method will be "numeric"; for character or factor values, the "factor" method is chosen. (We’ll get more into the various color computation methods a bit later in the post.)

An interesting thing about data_color() in gt 0.9.0 is that it works without having to supply any argument values! Previously, you needed to provide something for columns and a color-mapping function to colors. This made the function very difficult to use without first looking at a working example. We think that the new interface that prioritizes choosing a method will be better for most users (and you can still use a color-mapping function with the new fn argument).

Choosing a palette

Virtually nobody will want to rely on the default palette, so let’s take a look at some of the color-specification possibilities available in the new palette argument. It can take any of the following types of inputs:

  1. a vector of color names
  2. the name of an RColorBrewer palette
  3. the name of a viridis palette (e.g., "viridis", "magma", etc.)
  4. a discrete palette accessible from the paletteer package using the ⁠<package>::<palette>⁠ syntax

Let’s try each of these with four separate calls of data_color() on a simple table:

dplyr::tibble(red_green = 1:10, brewer = 1:10, viridis = 1:10, zissou = 1:10) |>
  gt() |>
  data_color(columns = red_green, palette = c("red", "green")) |>
  data_color(columns = brewer, palette = "Oranges") |>
  data_color(columns = viridis, palette = "viridis") |>
  data_color(columns = zissou, palette = "wesanderson::Zissou1") |>
  cols_width(everything() ~ px(100))
red_green brewer viridis zissou
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10

Notice how in the first column (red_green), there is interpolation between "red" (value 1) and "green" (value 10). The palette’s colors will be distributed evenly in the range of data available. This is the default behavior, and the range can be set with the domain argument. We can experiment with that using a new table:

dplyr::tibble(values = 1:10) |>
  gt() |>
  data_color(
    palette = c("red", "green"),
    domain = 3:7
  )
Warning: Some values were outside the color scale and will be treated as NA
values
1
2
3
4
5
6
7
8
9
10

When constraining the domain like this, any values that are outside of it are treated as NA (we even get a warning about it) and given a gray color reserved for NA values. We can use the na_color argument to provide a custom color if "#808080" isn’t suitable.

dplyr::tibble(values = 1:10) |>
  gt() |>
  data_color(
    palette = c("red", "green"),
    domain = 3:7,
    na_color = "steelblue"
  )
Warning: Some values were outside the color scale and will be treated as NA
values
1
2
3
4
5
6
7
8
9
10

We only should provide a single color to na_color, but it’s worth noting that when providing any sort of color, it can be a color name (R/X11 or CSS 3.0) or a hexadecimal string in the form of "#RRGGBB" or "#RRGGBBAA".

Color mapping methods

The previous uses of data_color() all used the "numeric" method of color mapping. Let’s take a look at the different methods and how you would use them. It’s instructive to use examples, so here’s one that uses all four method types:

dplyr::tibble(
  numeric = 1:10,
  bin = 1:10,
  quantile = 1:10,
  factor = vec_fmt_spelled_num(c(1:5, 1:5))
) |>
  gt() |>
  data_color(
    columns = numeric,
    method = "numeric",
    palette = "viridis"
  ) |>
  data_color(
    columns = bin,
    method = "bin",
    palette = "viridis",
    bins = c(1, 5, 7, 10)
  ) |>
  data_color(
    columns = quantile,
    method = "quantile",
    palette = "viridis",
    quantiles = 5
  ) |>
  data_color(
    columns = factor,
    method = "factor",
    palette = "viridis",
    levels = vec_fmt_spelled_num(1:5)
  ) |>
  cols_width(everything() ~ px(100))
numeric bin quantile factor
1 1 1 one
2 2 2 two
3 3 3 three
4 4 4 four
5 5 5 five
6 6 6 one
7 7 7 two
8 8 8 three
9 9 9 four
10 10 10 five

The first three columns use numbers from 1 to 10, and the different methods ("numeric", "bin", and "quantile") allow us to easily generate a color-mapping function with a few supporting arguments.

In the first column, using method = "numeric" creates a smooth ramp of colors across the "viridis" palette. The second column has the "bin" method applied, and this allows for the construction of bins in the bins argument. The "quantile" method used in the third column subdivides the values into equally-sized bins, settable through the quantiles argument. Finally, the "factor" method is best used for text-based values, as seen in the fourth column (though any type is valid). Factor levels are, by default, alphabetical, but the supporting levels argument lets you specify them directly.

Before gt 0.9.0, you were required to supply your own color-mapping function. This is still possible with the fn argument. Here’s an example of that using the col_numeric() function from the scales package:

countrypops |>
  dplyr::filter(country_name == "Mongolia") |>
  dplyr::select(-contains("code")) |>
  tail(10) |>
  gt() |>
  fmt_integer(columns = population) |>
  data_color(
    columns = population,
    fn = scales::col_numeric(
      palette = "viridis",
      domain = c(2.5E6, 3.4E6)
    )
  )
Warning: Some values were outside the color scale and will be treated as NA
country_name year population
Mongolia 2015 3,026,864
Mongolia 2016 3,088,856
Mongolia 2017 3,148,917
Mongolia 2018 3,208,189
Mongolia 2019 3,267,673
Mongolia 2020 3,327,204
Mongolia 2021 3,383,741
Mongolia 2022 3,433,748
Mongolia 2023 3,481,145
Mongolia 2024 3,524,788

If you’re not familiar with the color-mapping functions available in the scales package, just know that invoking col_numeric() will return a function (which is what the fn argument actually requires) that takes a vector of numeric values and returns color values.

Using scales-based functions in fn can be very useful if you want to make use of the specialized arguments available in the ⁠col_*()⁠ functions. You could even supply your own custom function for performing more complex colorizing treatments!

Applying color to other columns

The data_color() function now lets you apply colorization indirectly to other columns. That is, you can apply colors to a column different from the one used to generate those specific colors. This can be done with the new target_columns argument. Let’s look at how it’s done with a countrypops-based table example.

countrypops |>
  dplyr::filter(country_code_3 %in% c("FRA", "GBR")) |>
  dplyr::filter(year %% 10 == 0) |>
  dplyr::select(-contains("code")) |>
  dplyr::mutate(color = "") |>
  gt(groupname_col = "country_name") |>
  fmt_integer(columns = population) |>
  data_color(
    columns = population,
    target_columns = color,
    method = "numeric",
    palette = "viridis",
    domain = c(4E7, 7E7)
  ) |>
  cols_width(year ~ px(60), population ~ px(120), color ~ px(10)) |>
  tab_options(column_labels.hidden = TRUE) |>
  opt_vertical_padding(scale = 0.65)
France
1960 47,412,964
1970 52,007,169
1980 55,274,184
1990 58,261,012
2000 60,918,661
2010 65,026,211
2020 67,601,110
United Kingdom
1960 52,400,000
1970 55,663,250
1980 56,314,216
1990 57,247,586
2000 58,892,514
2010 62,766,365
2020 66,744,000

So, the colors are based on the data in the population column, but the colors are actually placed in the color column (which was made intentionally ‘blank’ by setting it entirely with empty strings).

When specifying a single column in columns, we can use as many target_columns values as we want. Let’s make another table where we map the generated colors from the year column to all columns in the table. We’ll use the underrated "inferno" palette (from the "viridis" collection) for this one.

countrypops |>
  dplyr::filter(country_code_3 %in% c("FRA", "GBR", "ITA")) |>
  dplyr::select(-contains("code")) |>
  dplyr::filter(year %% 5 == 0) |>
  tidyr::pivot_wider(
    names_from = "country_name",
    values_from = "population"
  ) |>
  gt() |>
  fmt_integer(columns = c(everything(), -year)) |>
  data_color(
    columns = year,
    target_columns = everything(),
    palette = "inferno"
  ) |>
  cols_width(
    year ~ px(80),
    everything() ~ px(160)
  ) |>
  opt_all_caps() |>
  opt_horizontal_padding(scale = 3) |>
  opt_vertical_padding(scale = 0.75) |>
  tab_options(
    table_body.hlines.style = "none",
    column_labels.border.top.color = "black",
    column_labels.border.bottom.color = "black",
    table_body.border.bottom.color = "black"
  )
year France United Kingdom Italy
1960 47,412,964 52,400,000 50,199,700
1965 49,877,725 54,348,050 52,112,350
1970 52,007,169 55,663,250 53,821,850
1975 54,002,853 56,225,800 55,441,001
1980 55,274,184 56,314,216 56,433,883
1985 56,665,619 56,550,268 56,593,071
1990 58,261,012 57,247,586 56,719,240
1995 59,541,294 58,019,030 56,844,303
2000 60,918,661 58,892,514 56,942,108
2005 63,180,854 60,401,206 58,166,682
2010 65,026,211 62,766,365 59,819,407
2015 66,548,272 65,088,000 60,229,605
2020 67,601,110 66,744,000 59,438,851

Another interesting thing that can be done now in 0.9.0 is the task of indirectly applying color in pairs. To do this, we make sure that the resolved number of columns in columns matches the number of columns in target_columns.

The towny dataset has columns with population values at different census years. It also has an associated set of columns that provide the percent change (as fractional values) across census years. In this next example, we will do the following things:

  1. perform color mapping on those change values (in columns)
  2. apply the colors indirectly to the population figures (with target_columns)
  3. hide the columns used to generate the color mappings (with cols_hide())
towny |>
  dplyr::filter(census_div %in% c("Oxford", "Essex")) |>
  dplyr::select(
    name, starts_with("population"), ends_with("pct"),
    -population_1996
  ) |>
  gt(rowname_col = "name") |>
  fmt_integer() |>
  data_color(
    columns = ends_with("pct"),
    target_columns = starts_with("population"),
    palette = c("red", "white", "green"),
    domain = c(-0.5, 0.5),
    na_color = "lightblue"
  ) |>
  cols_hide(columns = ends_with("pct")) |>
  cols_label_with(fn = function(x) gsub("population_", "", x)) |>
  opt_vertical_padding(scale = 0.6)
2001 2006 2011 2016 2021
Amherstburg 20,339 21,748 21,556 21,936 23,524
Blandford-Blenheim 7,630 7,149 7,359 7,399 7,565
East Zorra-Tavistock 7,238 7,008 6,836 7,113 7,841
Essex 20,085 20,032 19,600 20,427 21,216
Ingersoll 10,977 11,760 12,146 12,757 13,693
Kingsville 19,619 20,908 21,362 21,552 22,119
Lakeshore 28,746 33,245 34,546 36,611 40,410
LaSalle 25,285 27,652 28,643 30,180 32,721
Leamington 27,138 28,833 28,403 27,595 29,680
Norwich 10,478 10,481 10,721 10,835 11,151
Pelee 256 287 171 235 230
South-West Oxford 7,782 7,589 7,544 7,634 7,583
Tecumseh 25,105 24,224 23,610 23,229 23,300
Tillsonburg 14,052 14,822 15,301 15,872 18,615
Windsor 208,402 216,473 210,891 217,188 229,660
Woodstock 33,061 35,822 37,754 41,098 46,705
Zorra 8,052 8,125 8,058 8,138 8,628

We used a few more gt functions to clean up the table somewhat, but the bulk of the presentation lies in the use of data_color(). Because this is a fairly complex example, we recommended running the code in a statement-by-statement manner to see how each function call changes the output table.

An important note to make here is that the order of columns in both the columns and target_columns arguments should match the intended mapping order. That is the case in the above example, but other situations might vary (thus, it’s important to keep this in mind).

Row-wise color mapping

Colorization can now occur in a row-wise manner. The key to making that happen is by using direction = "row". Let’s try this out using the sza dataset. After some very necessary dplyr and tidyr work, we’ll put that data into a gt table and apply color to values across each ‘month’ of data in that table. We won’t set a domain value and instead use the bounds of the data in each row.

sza |>
  dplyr::filter(latitude == 20 & tst <= "1200") |>
  dplyr::select(-latitude) |>
  dplyr::filter(!is.na(sza)) |>
  tidyr::pivot_wider(
    names_from = tst,
    values_from = sza,
    names_sort = TRUE
  ) |>
  gt(rowname_col = "month") |>
  sub_missing(missing_text = "") |>
  data_color(
    direction = "row",
    palette = "PuOr",
    na_color = "white"
  ) |>
  tab_options(table.font.size = px(12)) |>
  opt_vertical_padding(scale = 0.75)
0530 0600 0630 0700 0730 0800 0830 0900 0930 1000 1030 1100 1130 1200
jan


84.9 78.7 72.7 66.1 61.5 56.5 52.1 48.3 45.5 43.6 43.0
feb

88.9 82.5 75.8 69.6 63.3 57.7 52.2 47.4 43.1 40.0 37.8 37.2
mar

85.7 78.8 72.0 65.2 58.6 52.3 46.2 40.5 35.5 31.4 28.6 27.7
apr
88.5 81.5 74.4 67.4 60.3 53.4 46.5 39.7 33.2 26.9 21.3 17.2 15.5
may
85.0 78.2 71.2 64.3 57.2 50.2 43.2 36.1 29.1 26.1 15.2 8.8 5.0
jun 89.2 82.7 76.0 69.3 62.5 55.7 48.8 41.9 35.0 28.1 21.1 14.2 7.3 2.0
jul 88.8 82.3 75.7 69.1 62.3 55.5 48.7 41.8 35.0 28.1 21.2 14.3 7.7 3.1
aug
83.8 77.1 70.2 63.3 56.4 49.4 42.4 35.4 28.3 21.3 14.3 7.3 1.9
sep
87.2 80.2 73.2 66.1 59.1 52.1 45.1 38.1 31.3 24.7 18.6 13.7 11.6
oct

84.1 77.1 70.2 63.3 56.5 49.9 43.5 37.5 32.0 27.4 24.3 23.1
nov

87.8 81.3 74.5 68.3 61.8 56.0 50.2 45.3 40.7 37.4 35.1 34.4
dec


84.3 78.0 71.8 66.1 60.5 55.6 50.9 47.2 44.2 42.4 41.8

When using direction = "row", we can see that each row has cell coloring that is relative to the range of values in the particular row. This is useful in those situations where you might feel the colorization should be made specific to the row.

One last thing, also to do with rows

The data_color() function now has a rows argument. Before that wasn’t there, and you had no choice but to color each and every row in the columns specified. Of course, sometimes you just want colorization in a specific region of the table. Here’s an example that demonstrates this (and we’re using the new metro dataset):

metro |>
  dplyr::select(name, passengers, connect_other) |>
  dplyr::arrange(desc(passengers)) |>
  head(15) |>
  gt(locale = "fr") |>
  tab_header(
    title = "Les stations de métro les plus fréquentées et
    leurs nombre annuel de passagers",
    subtitle = "Ceux qui sont à côté des gares sont surlignés en vert"
  ) |>
  fmt_integer() |>
  tab_row_group(
    label = "a côté d'une gare",
    rows = grepl("TGV", connect_other),
    id = "gare"
  ) |>
  data_color(
    columns = passengers,
    rows = grepl("TGV", connect_other),
    method = "numeric",
    palette = c("lightgreen", "green" |> adjust_luminance(steps = -2))
  ) |>
  cols_hide(columns = connect_other) |>
  cols_label(
    name ~ "station de métro",
    passengers = "passagers"
  ) |>
  cols_width(
    name ~ px(375),
    passengers ~ px(150)
  ) |>
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_row_groups(groups = "gare")
  ) |>
  opt_all_caps() |>
  opt_align_table_header(align = "left") |>
  opt_horizontal_padding(scale = 3) |>
  opt_table_font(stack = "rounded-sans")
Les stations de métro les plus fréquentées et leurs nombre annuel de passagers
Ceux qui sont à côté des gares sont surlignés en vert
station de métro passagers
a côté d'une gare
Gare du Nord 34 503 097
Saint-Lazare 33 128 384
Gare de Lyon 28 640 475
Montparnasse—Bienvenüe 20 407 224
Gare de l'Est 15 538 471
Bibliothèque François Mitterrand 11 104 474
République 11 079 708
Les Halles 10 623 876
La Défense 9 256 802
Châtelet 8 350 794
Bastille 8 069 243
Belleville 7 314 438
Hôtel de Ville 7 251 729
Place d'Italie 7 119 097
Bobigny—Pablo Picasso 6 561 327

Ce tableau de données là, c’est le fun!

In conclusion

We’ve wanted to improve the data_color() function of gt for a few years now, and we are so glad it is now a thing accomplished in version 0.9.0! The new version of this function is way more powerful than before (and hopefully easier to use too).

This is blog post number three of a series on gt version 0.9.0. There’s more to come, owing to the fact that this release of gt is a big one. We always want your feedback, and there are many different ways to get in touch with us. You can:

Until next time!

Rich Iannone

Software Engineer at Posit, PBC
Richard is a software engineer and table enthusiast. He and R go way back and he's been getting better at writing code in Python too. For the most part, Rich enjoys creating open source packages in R and Python so that people can do great things in their own work.