Open source packages - Quarto, Shiny, and more Commercial enterprise offerings

Merging columns together and fixing up column labels in {gt} `0.9.0`

Written by Rich Iannone
2023-06-14
MERGING COLUMNS AND FIXING COLUMN NAMES gt 0.9.0

The newest version of gt (0.9.0) has many refinements to what was already there. Sure, new functions were added, but we also took the time to fix up many existing functions and make them better. This (fifth) blog post about gt 0.9.0 is concerned with two major enhancements to existing functionality: (1) merging column data together and (2) modifying column labels to make them more suitable for presentation.

Column merging via cols_merge()

The cols_merge() function has been in the package since the first public release of gt. It’s useful! What it does is take input from two or more columns and allows the contents to be merged into a single column, using a pattern that specifies the arrangement.

Let’s try using this function in an example; only then will it all make some sense. With a portion of the metro dataset, we’ll format the numbers in the latitude and longitude columns, combine the data from both those columns into a single column (in this case, collapsing them into latitude column), and then clean up the column labels with cols_label().

library(gt)
Warning: package 'gt' was built under R version 4.4.3
metro_merge <-
  metro |>
  dplyr::filter(grepl("3rd", location)) |>
  dplyr::select(name, caption, latitude, longitude) |>
  dplyr::slice_head(n = 10) |>
  gt() |>
  fmt_number(
    columns = c(latitude, longitude),
    decimals = 4
  ) |>
  cols_merge(
    columns = c(latitude, longitude),
    pattern = "({1}, {2})"
  ) |>
  cols_label(
   name ~ "Station",
   latitude ~ "Coordinates"
  ) |>
  cols_hide(columns = caption)

metro_merge
Station Coordinates
Arts et Métiers (48.8655, 2.3561)
Réaumur—Sébastopol (48.8661, 2.3525)
République (48.8675, 2.3638)
Temple (48.8665, 2.3606)
Strasbourg—Saint-Denis (48.8697, 2.3539)
Chemin Vert (48.8568, 2.3683)
Filles du Calvaire (48.8642, 2.3661)
Saint-Sébastien—Froissart (48.8605, 2.3674)
Rambuteau (48.8612, 2.3534)

A table that had three columns now has two. What’s happening is that the formatted data from two columns is being combined in the latitude column (why that column? It’s the first column cited in the columns argument of cols_merge()), and the remaining column (longitude) undergoes the useful process of being hidden. The hide_columns argument of cols_merge() lets you modify this default behavior.

When merging column data, the thing gt will do is just put the values together, separated by a space character. Usually, this is not what you really want, so, a pattern specification can be used. In the above example, we used pattern = "({1}, {2})". Breaking this down, the "{1}" is the first column cited in columns, and the "{2}" is the second of the two. The rest of that string is made up of string literals. What we get in the end is the latitude and longitude values separated by a comma and a space, and all of that surrounded by parentheses.

This pattern syntax is great, so let’s look at a bit more of that in a continuation of the above example. Station names in the metro dataset sometimes have a caption value. To combine the caption with the station name in a readable format, we can use cols_merge() with the pattern "{1} ({2})" (where "{1}" corresponds to the name column, and "{2}" is the caption column). This is sensible, but as noted, not every station has an associated caption, so NAs unfortunately make their way into the table output:

metro_merge |>
  cols_unhide(columns = caption) |>
  cols_merge(
    columns = c(name, caption),
    pattern = "{1} ({2})"
  )
Station Coordinates
Arts et Métiers (NA) (48.8655, 2.3561)
Réaumur—Sébastopol (NA) (48.8661, 2.3525)
République (NA) (48.8675, 2.3638)
Temple (NA) (48.8665, 2.3606)
Strasbourg—Saint-Denis (NA) (48.8697, 2.3539)
Chemin Vert (NA) (48.8568, 2.3683)
Filles du Calvaire (NA) (48.8642, 2.3661)
Saint-Sébastien—Froissart (NA) (48.8605, 2.3674)
Rambuteau (Centre Georges Pompidou) (48.8612, 2.3534)

This is a problem, but gt 0.9.0 now has a solution for it. We can now use << >>⁠ to surround spans of text that will be removed if any of the contained {n} values yields a missing value. So if a particular station doesn’t have a caption value (i.e., it is NA), we will want to remove not only the missing value but also the parentheses we used in our pattern. The pattern should then change from "{1} ({2})" to "{1}<< ({2})>>". Let’s try it out!

metro_merge |>
  cols_unhide(columns = caption) |>
  cols_merge(
    columns = c(name, caption),
    pattern = "{1}<< ({2})>>"
  )
Station Coordinates
Arts et Métiers (48.8655, 2.3561)
Réaumur—Sébastopol (48.8661, 2.3525)
République (48.8675, 2.3638)
Temple (48.8665, 2.3606)
Strasbourg—Saint-Denis (48.8697, 2.3539)
Chemin Vert (48.8568, 2.3683)
Filles du Calvaire (48.8642, 2.3661)
Saint-Sébastien—Froissart (48.8605, 2.3674)
Rambuteau (Centre Georges Pompidou) (48.8612, 2.3534)

So much better! The cols_merge() function is now a bit more powerful thanks to the enhanced pattern syntax and functionality. Please check out the help article accessible through ?cols_merge for more details (particularly in the How the pattern works section). The syntax can even handle nested uses of << >> for more complex NA-handling situations.

Fixing up column labels with cols_label_with()

Column names do not often make for good column labels in tables for publication. Because of this, it’s very common to apply new labels to columns in a gt table with cols_label(). But sometimes you’re so close, you might want to replace underscores with spaces and put the separate words in title case. Using cols_label() to type out all the desired column labels can be such a drag. To make this situation better, gt 0.9.0 introduces the function called cols_label_with(). This new addition to the cols_*() family of functions lets you label columns using a function in its fn argument.

Let’s use sp500 to create a new gt table. We want all of the column labels to be entirely capitalized versions of the default labels, but instead of using cols_label() and rewriting each label manually in capital letters, we will use cols_label_with() and instruct it to apply the toupper() function to all column labels.

sp500 |>
  dplyr::filter(date >= "2015-12-01" & date <= "2015-12-11") |>
  dplyr::select(-c(adj_close, volume)) |>
  gt() |>
  cols_label_with(fn = toupper)
DATE OPEN HIGH LOW CLOSE
2015-12-11 2047.27 2047.27 2008.80 2012.37
2015-12-10 2047.93 2067.65 2045.67 2052.23
2015-12-09 2061.17 2080.33 2036.53 2047.62
2015-12-08 2073.39 2073.85 2052.32 2063.59
2015-12-07 2090.42 2090.42 2066.78 2077.07
2015-12-04 2051.24 2093.84 2051.24 2091.69
2015-12-03 2080.71 2085.00 2042.35 2049.62
2015-12-02 2101.71 2104.27 2077.11 2079.51
2015-12-01 2082.93 2103.37 2082.93 2102.63

Let’s have a look at another example, this time using a portion of the countrypops dataset. We can, if we prefer, apply multiple types of column label changes in sequence with multiple calls of cols_label_with(). Here, we use the make_clean_names() function from the janitor package and follow up with the removal of a numeral with gsub().

countrypops |>
  dplyr::filter(year == 2021) |>
  dplyr::filter(grepl("^C", country_code_3)) |>
  dplyr::select(-country_code_2, -year) |>
  head(8) |>
  gt() |>
  cols_move_to_start(columns = country_code_3) |>
  fmt_integer(columns = population) |>
  cols_label_with(
    fn = ~ janitor::make_clean_names(., case = "title")
  ) |>
  cols_label_with(
    fn = ~ gsub("[0-9]", "", .)
  )
Country Code Country Name Population
CAF Central African Republic 5,112,100
CAN Canada 38,239,864
CHE Switzerland 8,704,546
CHL Chile 19,456,334
CHN China 1,412,360,000
CIV Cote d'Ivoire 29,639,736
CMR Cameroon 26,915,758
COD Congo (DRC) 99,148,932

With the sza dataset, we can make a table where the column names are dates. The labels can be generated as month names in any language using cols_label_with() with vec_fmt_datetime() as the function given in fn.

sza |>
  dplyr::filter(latitude == 50) |>
  dplyr::mutate(date = paste0(
    "2000-",
    formatC(
      match(month, tolower(month.abb)),
      width = 2,
      flag = "0"
    ),
    "-01"
  )) |>
  dplyr::select(-c(latitude, month)) |>
  tidyr::pivot_wider(names_from = date, values_from = sza) |>
  gt(rowname_col = "tst") |>
  sub_missing(missing_text = "") |>
  cols_width(everything() ~ px(50)) |>
  data_color(
    palette = c("yellow", "darkblue"),
    domain = c(0, 90),
    na_color = "darkblue" |> adjust_luminance(steps = -1)
  ) |>
  tab_options(
    table_body.hlines.style = "none",
    table_body.border.bottom.color = "black",
    column_labels.border.top.color = "black",
    column_labels.border.bottom.color = "black",
    stub.border.color = "black"
  ) |>
  cols_label_with(fn = function(x) {
    vec_fmt_datetime(x, format = "MMM", locale = "de")
    }
  )
Jan. Feb. März Apr. Mai Juni Juli Aug. Sept. Okt. Nov. Dez.
0400





89.7




0430




86.7 85.7 89.7



0500



87.8 82.4 81.5 85.4



0530



83.2 78.0 77.1 80.8 88.3


0600


86.6 78.5 73.3 72.5 76.2 83.6


0630


81.8 73.7 68.6 67.8 71.4 78.8 87.6

0700

86.3 76.9 68.9 63.8 63.0 66.6 74.0 82.8

0730
89.5 81.8 72.2 64.1 59.0 58.2 61.8 69.2 78.2 87.1
0800
85.3 77.4 67.6 59.4 54.2 53.4 57.0 64.6 73.8 83.0 89.2
0830 86.5 81.5 73.3 63.2 54.7 49.5 48.7 52.4 60.1 69.6 79.0 85.5
0900 83.2 78.0 69.6 59.1 50.3 44.9 44.1 47.9 55.9 65.7 75.5 82.1
0930 80.2 74.9 66.2 55.4 46.2 40.6 39.7 43.7 52.0 62.1 72.3 79.1
1000 77.7 72.2 63.2 52.0 42.5 36.6 35.6 39.9 48.5 59.1 69.5 76.5
1030 75.7 70.0 60.9 49.3 39.4 33.1 32.1 36.6 45.7 56.5 67.3 74.5
1100 74.2 68.5 59.1 47.2 37.0 30.4 29.3 34.1 43.5 54.7 65.7 73.0
1130 73.3 67.5 58.0 45.9 35.5 28.6 27.5 32.5 42.1 53.5 64.7 72.1
1200 73.0 67.2 57.7 45.5 35.0 28.0 26.9 31.9 41.6 53.1 64.4 71.8

Here’s one more gt table example with cols_label_with(), and this time, we just have to use the venerable pizzaplace dataset. There are ways to use one instance of cols_label_with() with multiple functions called on the column labels. We’ll try that out here by providing an anonymous function (with the function(x) { ... } construction) that performs multiple mutations of x (the vector of column labels). In this example code, we’re even going to use the md() helper function to signal to gt that the column label should be interpreted as Markdown text.

pizzaplace |>
  dplyr::mutate(month = substr(date, 6, 7)) |>
  dplyr::group_by(month) |>
  dplyr::summarize(pizze_vendute = dplyr::n()) |>
  dplyr::ungroup() |>
  dplyr::mutate(frazione_della_quota = pizze_vendute / 4000) |>
  dplyr::mutate(date = paste0("2015/", month, "/01")) |>
  dplyr::select(-month) |>
  gt(rowname_col = "date") |>
  fmt_date(date, date_style = "month", locale = "it") |>
  fmt_percent(columns = frazione_della_quota) |>
  fmt_integer(columns = pizze_vendute) |>
  cols_width(everything() ~ px(100)) |>
  cols_label_with(
    fn = function(x) {
      janitor::make_clean_names(x, case = "title") |>
        toupper() |>
        stringr::str_replace_all("^|$", "**") |>
        md()
    }
  )
PIZZE VENDUTE FRAZIONE DELLA QUOTA
gennaio 4,232 105.80%
febbraio 3,961 99.02%
marzo 4,261 106.53%
aprile 4,151 103.77%
maggio 4,328 108.20%
giugno 4,107 102.68%
luglio 4,392 109.80%
agosto 4,168 104.20%
settembre 3,890 97.25%
ottobre 3,883 97.08%
novembre 4,266 106.65%
dicembre 3,935 98.38%

Bellissimo! We hope that the cols_label_with() is something you can use to avoid unnecessary repetition in your table-making code! Much more information can be found in the detailed help file for this function (use ?cols_label_with to access).

An improved cols_label() function

Aside from the new cols_label_with() function, we also took the time to make some targeted improvements to cols_label(). If you never really used it, what it does is allow for column labels to be modified. The basic model is that you provide the column names (i.e., those from the input data table) on the left and the column labels on the right.

A new thing that was added is the inclusion of a .fn argument. This gives you the opportunity to supply a function that is to be called on each label provided. For example, you can use the md() helper function (which signals to gt that the labels are given as Markdown text) in .fn once instead of calling it on each and every label.

metro |>
  dplyr::select(name, lines, passengers, connect_other) |>
  dplyr::arrange(desc(passengers)) |>
  dplyr::slice_head(n = 10) |>
  gt() |>
  cols_hide(columns = passengers) |>
  sub_missing(missing_text = "") |>
  cols_label(
    name = "Name of<br>Metro Station",
    lines = "Metro<br>Lines",
    connect_other = "Train<br>Services",
    .fn = md
  )
Name of
Metro Station
Metro
Lines
Train
Services
Gare du Nord 4, 5 TGV, TER, Thalys, Eurostar
Saint-Lazare 3, 12, 13, 14 TGV, TER, Intercités
Gare de Lyon 1, 14 TGV, TGV Lyria, Renfe-SNCF, OUIGO, Frecciarossa
Montparnasse—Bienvenüe 4, 6, 12, 13 TGV, TER, Intercités, OUIGO
Gare de l'Est 4, 5, 7 TGV, TER, OUIGO, Nightjet
Bibliothèque François Mitterrand 14
République 3, 5, 8, 9, 11
Les Halles 4
La Défense 1
Châtelet 1, 4, 7, 11, 14

We have a new way of expressing column labels in cols_label(), and that is through the use of two-sided formulas (⁠<LHS> ~ <RHS> instead of ⁠<LHS> = <RHS>). By using tidyselect expressions on the left side of the formulas, we can easily relabel multiple columns with common label text. Let’s create an interesting gt table that exemplifies this (using the towny dataset):

towny |>
  dplyr::select(
    name, ends_with("2001"), ends_with("2006"), matches("2001_2006")
  ) |>
  dplyr::filter(population_2001 > 100000) |>
  dplyr::arrange(desc(pop_change_2001_2006_pct)) |>
  dplyr::slice_head(n = 10) |>
  gt() |>
  fmt_integer() |>
  fmt_percent(columns = matches("change"), decimals = 1) |>
  tab_spanner(label = "Population", columns = starts_with("population")) |>
  tab_spanner(label = "Density", columns = starts_with("density")) |>
  cols_label(
    ends_with("01") ~ "2001",
    ends_with("06") ~ "2006",
    matches("change") ~ md("Population Change,<br>2001 to 2006")
  ) |>
  cols_width(everything() ~ px(120))
name
Population
Density
Population Change,
2001 to 2006
2001 2006 2001 2006
Brampton 325,428 433,806 1,224 1,632 33.3%
Vaughan 182,022 238,866 668 877 31.2%
Markham 208,615 261,573 989 1,240 25.4%
Barrie 103,710 128,430 1,047 1,297 23.8%
Richmond Hill 132,030 162,704 1,310 1,614 23.2%
Oakville 144,738 165,613 1,042 1,192 14.4%
Mississauga 612,925 668,599 2,094 2,284 9.1%
Cambridge 110,372 120,371 977 1,065 9.1%
Burlington 150,836 164,415 810 883 9.0%
Guelph 106,170 114,943 1,214 1,315 8.3%

Note that we cannot use an = sign in any of the expressions within cols_label() in the above example. Because the LHS is not a single-column name, we must use formula syntax (i.e., with the ~). We are not limited to just using tidyselect expressions on the LHS. We can alternatively supply vectors of column names (and this is great for using repetitive labels across many columns).

Signing off

We want to make things better in gt (always). If you identify points of friction that impede your own gt workflows, let us know! You can file an issue on GitHub. Don’t be shy about sharing ideas or asking questions. We are there for you on the gt Discussions page.

We occasionally share things on Twitter at @gt_package, so you should follow us there. Our gt_package Discord server is active and also lots of fun (drop by and show us your table creations in there!).

Rich Iannone

Software Engineer at Posit, PBC
Richard is a software engineer and table enthusiast. He and R go way back and he's been getting better at writing code in Python too. For the most part, Rich enjoys creating open source packages in R and Python so that people can do great things in their own work.