library(forcats)
library(tidyr)
# Transform and filter the `movies_na` data set
movies_na_long <- movies_na %>%
# Pivot genre columns (from Action to Short) into long format:
# Creates two columns: "genre" and "value"
pivot_longer(Action:Short, names_to = "genre") %>%
# Filter to keep only:
# - rows where the movie is flagged in that genre (value != 0)
# - only the genres Comedy and Drama
# - movies released between 1900 and 2000
filter(
value != 0,
genre %in% c("Comedy", "Drama"),
between(year, 1900, 2000)
) %>%
# Convert genre to a factor
mutate(genre = as_factor(genre)) %>%
# Count the number of movies per genre per year
count(genre, year)
3 Geoms
Everything is liable to change, even the title
A geom (short for geometric object) is a component that defines how data is visually represented in a plot. Geoms determine the type of visualization or the graphical shape that will be drawn.
“These geoms are the fundamental building blocks of ggplot2 (Wickham et al. 2025). They are useful in their own right, but are also used to construct more complex geoms. Most of these geoms are associated with a named [chart]: when that geom is used by itself in a [chart], that [chart] has a special name.” (Wickham 2016)
ggplot2 already has a long list of geoms. We won’t be discussing those unless there is an extension package that is an improvement to the original. Primarily, this chapter focuses on the geoms that ggplot2 does not include.
3.1 Area charts
Area charts are based on line charts. The area between the x-axis and each line (or the area between lines) is shaded to help highlight the volume of the data.
In this section, we’ll take a look at the horizon chart, an improved version of the ribbon chart, and the streamgraph. They are all different takes on the area chart.
3.1.1 Horizon chart
A horizon chart is a method for condensing time series data into a format that is both informative and relatively easy to interpret.
Often, when you have both positive and negative values, they lie on both sides of the x-axis. In a horizon chart, the negative values are on the same side as the positive ones.
We use color to show whether the values are positive or negative. But also for the magnitude of those values.
As Jonathan Schwabish points out in their book, Better Data Visualizations (Schwabish 2021), “the purpose of the horizon chart is not necessarily to enable readers to pick out specific values, but instead to easily spot general trends and identify extreme values”.
3.1.1.1 Viz #1: Helsinki temperatures, part I
For the horizon chart, we’ll be using ggHoriPlot (Rivas-González 2022). The package includes various example data sets. But we’ll be using weather data from the Finnish Meteorological Institute (FMI). Its open data, weather observations are licensed under CC BY 4.0.
Using the FMI API (Application Programming Interface), I retrieved the average temperatures in Helsinki (Kaisaniemi weather station) for the years 2000-2024. You can take a look at the data below.
We have avg_temperature_celsius (daily average temperature (in Celsius)), day, month, and year. We also have the date_dummy column. It is there because we want to use the month as the x-axis. But the column needs to be in date format for our use case. So we need all the rows to have the same dummy year with real months and days. I chose 2024 because it was a leap year. Without it, all the rows with February 29th would have NA in that column instead of the correct values.
Before we can proceed with the visualization, we need to perform some data wrangling. First, we’ll remove outliers using the interquartile range (IQR) method.
library(dplyr)
# Filter temperature data to exclude outliers based on 1.5 * IQR method
cutpoints <- temperature_hki %>%
filter(
between(
avg_temperature_celsius,
quantile(
avg_temperature_celsius, 0.25, na.rm = TRUE
) - 1.5 * IQR(avg_temperature_celsius, na.rm = TRUE),
quantile(
avg_temperature_celsius, 0.75, na.rm = TRUE
) + 1.5 * IQR(avg_temperature_celsius, na.rm = TRUE)
)
)
Fifteen outliers were filtered out, and we can continue. Next, we’ll calculate the midpoint of the temperature range and also divide the scale into evenly spaced value ranges. We’ll use the first as the origin for the horizon chart and the second to determine how to color the areas.
# Calculate the midpoint of the temperature range for use in horizon chart
origin <- cutpoints %>%
summarize(origin = mean(range(avg_temperature_celsius))) %>%
pull(origin)
# Create the scale vector:
# 7 evenly spaced values across the filtered temperature range.
# Drop the 4th value (the midpoint), as required by gghoriplot scale input
scale <- cutpoints %>%
summarize(
min_val = min(avg_temperature_celsius),
max_val = max(avg_temperature_celsius)
) %>%
# Generate 7 evenly spaced values
with(seq(min_val, max_val, length.out = 7)) %>%
# Convert to tibble to use dplyr::slice()
tibble() %>%
# Remove the middle value (4th out of 7)
slice(-4) %>%
# Return as plain numeric vector
pull(.)
The origin is 3.55, and the scale cutpoints are as follows: -19.3, -11.68, -4.07, 11.17, 18.78, 26.4.
Now we’re ready for the visualization itself. Besides ggHoriPlot and ggplot2, we’ll be using ggthemes (Arnold 2024) to provide us the theme. We’ll dive deeper into themes (including ggthemes in Section 9.1.6) later on in Chapter 9.
We’re using geom_horizon()
to create the horizon chart. The arguments to pay attention to are fill (inside aes()
), origin, and horizonscale. They are all using the origin and scale we calculated earlier. scale_fill_hcl()
is also available in the ggHoriPlot package. Otherwise, we’re using basic ggplot2 functionalities.
library(ggHoriPlot) # for geom_horizon() to create horizon plots
library(ggplot2) # for general plotting functions
library(ggthemes) # for additional themes like theme_few()
# Create the horizon chart
ggplot(temperature_hki) +
# Horizon chart layer, mapping x, y, and fill aesthetics
geom_horizon(
aes(
date_dummy, # x-axis: typically date
avg_temperature_celsius, # y-axis: temperature variable
fill = ..Cutpoints.. # fill determined by horizon chart cutpoints
),
origin = origin, # baseline (e.g., 0°C); defines neutral midpoint
horizonscale = scale # vertical scale; controls how bands are split
) +
# Use a diverging color scale (red-blue), reversed so red = high temp
scale_fill_hcl(palette = 'RdBu', reverse = TRUE) +
# Create one small horizon chart per year, stacked vertically
facet_grid(vars(year)) +
# Use a clean, simple theme based on the rules and examples from Stephen
# Few's Show Me the Numbers and Practical Rules for Using Color in Charts
theme_few() +
# Customize appearance
theme(
# Customize x-axis labels
axis.text.x = element_text(size = 10),
# Remove unnecessary labels
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
panel.border = element_blank(),
# Remove vertical space between facets
panel.spacing.y = unit(0, "lines"),
# Subtle caption style
plot.caption = element_text(size = 8, hjust = 0, color = "#777777"),
# Adjust margins because otherwise Jan is too close to the left edge
plot.margin = margin(10, 10, 10, 15),
# Customize facet labels
strip.text.y = element_text(size = 8, angle = 0, hjust = 0)
) +
# Format x-axis to show months (with short names) without expansion
scale_x_date(
expand = c(0, 0),
date_breaks = "1 month",
date_labels = "%b"
) +
# Add informative plot title, subtitle, and data source caption
labs(
title = "Average daily temperature (Celsius) in Helsinki",
subtitle = "From 2000 to 2024",
caption = "Data: Finnish Meteorological Institute open data, weather observations (CC BY 4.0) | Visualization: Antti Rask",
x = NULL # remove x-axis title
)
I’m not a climatologist, but it does seem like there is a trend, over time, of Helsinki having milder winters. The summer temperatures are less clear-cut and will need a closer look.
3.1.2 Ribbon chart (improved)
As the ggplot2 documentation tells us, an area chart is, in fact, a special case of a ribbon chart. That makes sense when you realize that every type of area chart has a ymin and ymax. In the basic area chart, ymin is zero, and ymax is y. (Wickham et al. 2025)
The ribbon chart, then, displays the area between two lines. geom_ribbon()
gets the job done when the lines don’t meet. But as you’ll soon see, for those cases where they do, you need something called braiding. You can read more about it in the ggbraid (Grantham 2025) documentation.
3.1.2.1 Viz #2: IMDB movies, Part I
We’ll take a look at what the problem (and solution) looks like with real data. The question we’d like to answer here is which genre, comedy or drama, has produced more movies during the 20th century.
But let’s first examine the data we’ll be using for this. Time series data works best for this type of chart. Let’s stick to the ggplot2movies (Wickham 2015) data set that we first encountered in Section 2.1.2 and the movies_na tibble.
We’ll perform some transformations to prepare the data for visualization. In this case, we’ll need the data in two formats, long and short.
Let’s first create the long tibble consisting of genre, year, and n (for the count).
One minor detail to note here is that we’ll turn the genre column from character to factor format. We’ll use forcats (Wickham 2023a) from tidyverse (Wickham 2023c) to do that. It’s not necessary in this case, but it’s good practice.
Next, we’ll split the genre into two separate columns, Comedy and Drama. They will retrieve their values from the n column.
We’ll also be adding a fill_condition column. We’ll use that later to determine which color to use to fill the area between the two lines.
# Convert long-format genre counts back to wide format and compare values
movies_na_wide <- movies_na_long %>%
# Pivot genre counts from long to wide:
# Each genre becomes its own column (e.g., Comedy, Drama),
# with yearly counts as values
pivot_wider(
names_from = genre,
values_from = n
) %>%
# Create a new logical column indicating whether
# Comedy had fewer movies than Drama in that year
mutate(fill_condition = Comedy < Drama)
Now we can return to the topic of why we need to use an extension for cases where the lines don’t stay separate.
Here’s what the basic visualization would look like with geom_ribbon()
from ggplot2.
ggplot() +
geom_line(
aes(year, n, linetype = genre),
data = movies_na_long
) +
geom_ribbon(
aes(year, ymin = Comedy, ymax = Drama, fill = fill_condition),
data = movies_na_wide,
alpha = 0.7
)
That won’t work if we want to use the ribbon chart to show where the two categories change places, indicating which is greater.
But that’s where ggbraid’s geom_braid()
comes to the rescue. The basic code is the same. We’ll only switch the geom function.
The rest of the code is to make the visualization more presentable. Note that we’ll move the legend inside the plot. We’ll use the legend.position argument inside the theme()
function to do that. There’s enough white space inside the plot to accommodate the legend. This way, we gain more space to showcase the time series element of the plot.
library(ggbraid) # for geom_braid() to visualize overlapping time series
library(ggplot2) # for general plotting functions
ggplot() +
# Line plot for number of movies per genre per year
geom_line(
aes(year, n, linetype = genre),
data = movies_na_long
) +
# Braid layer to highlight which genre had more movies per year
geom_braid(
aes(year, ymin = Comedy, ymax = Drama, fill = fill_condition),
data = movies_na_wide,
alpha = 0.7
) +
# Text annotation when comedies dominated
annotate(
"text",
x = 1938,
y = 300,
size = 4,
label = "More comedies than drama",
hjust = 0.5,
color = "#F36523"
) +
# Text annotation when dramas dominated
annotate(
"text",
x = 1975,
y = 80,
size = 4,
label = "More drama than comedies",
hjust = 0.5,
color = "#125184"
) +
# Manually set fill colors for the braid based on fill condition
scale_fill_manual(values = c("#F36523", "#125184")) +
# Customize x-axis: limit, spacing, and ticks
scale_x_continuous(
expand = c(0, 1),
limits = c(1899, 2001),
breaks = seq(1900, 2000, by = 10)
) +
# Customize y-axis: limit, spacing, and ticks
scale_y_continuous(
expand = c(0, 1),
limits = c(0, 800),
breaks = seq(0, 800, by = 100)
) +
# Hide fill legend (keep linetype legend only)
guides(fill = "none") +
# Add plot title, subtitle, and axis labels
labs(
linetype = NULL,
title = "100 years of cinema",
subtitle = "Number of comedies vs. dramas throughout the 20th century",
caption = "Data: IMDb movies (1893-2005) via {ggplot2movies} | Visualization: Antti Rask",
x = NULL,
y = NULL
) +
# Use a clean black-and-white theme as a base
theme_bw() +
# Custom legend appearance and positioning
theme(
legend.direction = "horizontal",
legend.box.background = element_rect(
color = "black",
linetype = "solid",
linewidth = 0.5
),
legend.key.size = unit(2, "line"),
legend.position = c(0.19, 0.88), # relative position inside plot
legend.text = element_text(size = 10)
)
I’m not an expert on this topic either. Based on this data set, however, there appears to be a correlation between major wars (WWI, WWII, and the Vietnam War) and the production of more comedies than dramas.
3.1.2.2 Viz #3: Helsinki temperatures, part II
Let’s take a look at another ribbon chart. We can use the data set from Section 3.1.1, which contains average daily temperatures in Helsinki from 2000 to 2024. We’ll compare the two years, 2000 and 2024. Which one had more warmer days?
First, we’ll perform similar transformations as before and convert the data into both long and wide formats.
One minor detail to note here is that we’ll need to turn the year column from numeric to factor format. It won’t work as a category for the linetype argument otherwise. We’ll again use the forcats package to do that.
temperature_hki_long <- temperature_hki %>%
# Filter data to include only the years 2000 and 2024
filter(year %in% c(2000, 2024)) %>%
# Convert `year` to a factor (useful for plotting or grouping)
mutate(year = as_factor(year)) %>%
# Keep only the necessary columns for analysis or visualization
select(avg_temperature_celsius, year, date_dummy)
In this next transformation, note the use of the names_prefix argument. A column with a number as the first character of the name is not ideal. This will take care of that.
# Pivot data from long to wide format
temperature_hki_wide <- temperature_hki_long %>%
# Creates one column per year (e.g., year_2000, year_2024),
# using temperature values as the content
pivot_wider(
names_from = year,
names_prefix = "year_",
values_from = avg_temperature_celsius
) %>%
# Create a new logical column to compare the two years:
# TRUE if 2024 temp > 2000 temp for that date
mutate(fill_condition = year_2000 < year_2024)
We’ll also count the number (and percentage of total) of days where the average temperature is greater in 2000 and 2024. We’ll use this information for annotations.
temperature_hki_wide %>%
# Count how many days had each condition (TRUE/FALSE)
count(fill_condition) %>%
# Calculate the percentage for each group
mutate(n_pct = round(n / sum(n), 3))
# A tibble: 2 × 3
fill_condition n n_pct
<lgl> <int> <dbl>
1 FALSE 145 0.396
2 TRUE 221 0.604
Looks like 2024 has more days (60.4%) that were, on average, warmer than 2000 (39.6%).
The visualization itself is like the movie example. The most significant difference is the use of two packages from the tidyverse family.
str_glue()
from stringr (Wickham 2023b) features a convenient implicit line break functionality. We’ll also use it to add the degree Celsius symbol (°C) to the y-axis.
as_date()
from lubridate (Spinu, Grolemund, and Wickham 2024) allows us to use the date in character format to map it to the x-axis. This helps us place the annotations in the correct position.
library(ggbraid) # for geom_braid(), visualizing area between two lines
library(ggplot2) # for general plotting functions
library(lubridate) # for working with date types
library(stringr) # for string manipulation like str_glue()
ggplot() +
# Add temperature lines for each year (2000 and 2024)
geom_line(
aes(date_dummy, avg_temperature_celsius, linetype = year),
data = temperature_hki_long
) +
# Add braided area showing difference between 2000 and 2024
# Fill based on which year was warmer (fill_condition)
geom_braid(
aes(
date_dummy,
ymin = year_2000,
ymax = year_2024,
fill = fill_condition
),
data = temperature_hki_wide,
alpha = 0.7
) +
# Annotate area where 2000 was warmer
annotate(
"text",
x = as_date("2024-03-01"),
y = -17.5,
size = 4,
label = str_glue(
"2000 was warmer
40 % of the days"
),
hjust = 0.5,
color = "#125184"
) +
# Annotate area where 2024 was warmer
annotate(
"text",
x = as_date("2024-11-15"),
y = 17.5,
size = 4,
label = str_glue(
"2024 was warmer
60 % of the days"
),
hjust = 0.5,
color = "#F36523"
) +
# Manual fill colors: blue for 2000 warmer, orange for 2024 warmer
scale_fill_manual(values = c("#125184", "#F36523")) +
# Format x-axis: monthly ticks, short month labels
scale_x_date(
date_breaks = "1 month",
date_labels = "%b",
expand = c(0, 0.1)
) +
# Format y-axis: show temperature with °C symbol
scale_y_continuous(labels = ~ str_glue("{.x} °C")) +
# Hide fill legend (keep linetype legend only)
guides(fill = "none") +
# Add plot title, subtitle, and caption
labs(
linetype = NULL,
title = "Is the temperature rising?",
subtitle = "Average daily temperatures (Celsius) in Helsinki, 2000 vs. 2024",
caption = "Data: Finnish Meteorological Institute open data, weather observations (CC BY 4.0) | Visualization: Antti Rask",
x = NULL,
y = NULL
) +
# Use a clean, minimal black-and-white theme
theme_bw() +
# Customize legend and caption styling
theme(
legend.direction = "horizontal",
legend.box.background = element_rect(
color = "black",
linetype = "solid",
linewidth = 0.5
),
legend.key.size = unit(2, "line"),
legend.position = c(0.83, 0.12), # bottom-right position
legend.text = element_text(size = 10),
plot.caption = element_text(size = 8, hjust = 1, color = "#777777")
)
And so we have another perspective on the Helsinki temperature data set.
3.1.3 Streamgraph
A streamgraph is a stacked area chart where the areas are positioned around the central axis.
3.1.3.1 Viz #4: ggplot2 dependencies
If you’ve read the book in chronological order, you’ve already encountered a streamgraph in Section 1. We will use that visualization as the example in this section.
The purpose of this visualization is to illustrate how the ggplot2 dependencies have evolved from a small speck in 2007 to their current state at the end of 2024. This is the exact use case for which I would use a streamgraph.
Before we proceed, I want to mention Georgios Karamanis. Their original visualization for Tidy Tuesday (see Figure 3.1) was the inspiration for my version.
If you want to know the differences, they are as follows:
- Use the initial release years instead of the latest release…
- …which meant switching the data source
- Use the ggplot2-related packages’ metadata
- Bring in a third type, Imports
- Change the color palette
- Change the fonts to Roboto Mono
- Annotate all the major ggplot2 releases
- Make the stream chart less wavy
- Other, smaller changes
We’ll cover these in more detail as we proceed.
Let’s take a look at the data we’re working with. It was gathered with pkgsearch (Csárdi and Salmon 2025).
We won’t reiterate the definitions of the different types (see Section 1 for them). But we can see that the count for each type starts small in 2007 and increases significantly by 2024.
Let’s first take a look at what the streamgraph would look like with default settings. We’ll use geom_stream()
from ggstream (Sjoberg 2021) for that.
library(ggstream)
ggplot2_dependencies_by_year %>%
ggplot() +
geom_stream(
aes(
x = year,
y = n,
fill = type
)
)
That’s not bad for ten lines of code. But we can make it more presentable.
Let’s start with colors. In the previous visualizations, we’ve inserted the hex codes straight into the code. But since we need to use these colors in many places, let’s convert them into a vector so we don’t have to repeat ourselves.
color_1 <- "#F36523"
color_2 <- "#125184"
color_3 <- "#2E8B57"
colors_viz_4 <- c(color_1, color_2, color_3)
colors_viz_4
[1] "#F36523" "#125184" "#2E8B57"
For the font, I wanted to use something different. Roboto is a slightly futuristic font (family) that I like. But it doesn’t come with R.
That’s why we’ll use showtext (Qiu and Raggett 2024) (read more about it in Section 9.6.2).
library(showtext)
font_add_google("Roboto Mono", "roboto")
showtext_auto()
font_family <- "roboto"
Next, we’ll create the annotations that’ll be displayed on the right side of the graph.
The labels are in an HTML-style format. To be used later with ggtext (Wilke and Wiernik 2022). We’ll delve into this topic in more detail in Section 9.4.1.
annotation_numbers <- ggplot2_dependencies_by_year %>%
# Total number of packages per dependency type
summarize(
n = sum(n),
.by = type
) %>%
# Ensure consistent vertical stacking order
arrange(type) %>%
mutate(
# Manually assign y-positions for labels
y = c(390, 75, -300),
# Create HTML-styled rich labels with colored numbers
label = case_when(
type == "Depends" ~
str_glue("**<span style='color:{color_1}'>{n}</span>**"),
type == "Imports" ~
str_glue("**<span style='color:{color_2}'>{n}</span>**"),
type == "Suggests" ~
str_glue("**<span style='color:{color_3}'>{n}</span>**")
)
)
Now it’s time to bring it all together.
Usually, you would start with the main geom for the plot. But we must begin with the data points and labels due to the layer order. We’re stacking layers on top of each other, and here we want the lines for the labels to stay behind the streamgraph.
The new packages we haven’t mentioned before are colorspace (Ihaka et al. 2024) and ggrepel (Slowikowski 2024).
We get back to them both in detail later, but colorspace (see more in Section 16.2.2.1)) is used for color manipulation. In this graph, we’re making the borders of the areas slightly darker than the areas themselves. ggrepel (see more in Section 5.9) is used for creating labels that don’t overlap.
library(colorspace) # for color manipulation like darken()
library(ggplot2) # for general plotting functions
library(ggrepel) # for placing non-overlapping labels
library(ggstream) # for creating streamgraphs
library(ggtext) # for text manipulation
library(stringr) # for string manipulation like str_glue()
ggplot(ggplot2_dependencies_by_year) +
# Add a small point at the origin to anchor the first version label
geom_point(
aes(x = 2007, y = 0),
data = NULL,
size = 1.5,
stat = "unique",
) +
# Add a label for ggplot2 version 0.5 (2007)
geom_label_repel(
aes(x = 2007, y = 0, label = "{ggplot2}\nver 0.5"),
data = NULL,
stat = "unique",
nudge_y = 75,
label.size = NA,
lineheight = 0.5,
family = font_family
) +
# Add a label for ggplot2 version 1.0 (2014)
geom_label_repel(
aes(x = 2014, y = 50, label = "{ggplot2}\nver 1.0"),
data = NULL,
stat = "unique",
nudge_y = 115,
label.size = NA,
lineheight = 0.5,
family = font_family
) +
# Add a label for ggplot2 version 2.0 (2015)
geom_label_repel(
aes(x = 2015, y = 125, label = "{ggplot2}\nver 2.0"),
data = NULL,
stat = "unique",
nudge_y = 140,
label.size = NA,
lineheight = 0.5,
family = font_family
) +
# Add a label for ggplot2 version 3.0 (2018)
geom_label_repel(
aes(x = 2018, y = 100, label = "{ggplot2}\nver 3.0"),
data = NULL,
stat = "unique",
nudge_y = 200,
label.size = NA,
lineheight = 0.5,
family = font_family
) +
# Add a label for ggplot2 version 3.5 (2024)
geom_label_repel(
aes(x = 2024, y = 410, label = "{ggplot2}\nver 3.5"),
data = NULL,
stat = "unique",
nudge_y = 75,
label.size = NA,
lineheight = 0.5,
family = font_family
) +
# Create a stream-like area chart showing the number of dependent
# packages by type and year
geom_stream(
aes(
x = year,
y = n,
fill = type,
# Slightly darken stream borders for contrast
color = after_scale(darken(fill))
),
# bw means bandwidth of kernel density estimation
# This is the argument you can control the waviness with
# The closer the value is to 1, the less wavy the graph
bw = 1,
linewidth = 0.1
) +
# Add annotations on the right side using rich text labels from ggtext
geom_richtext(
data = annotation_numbers,
aes(
x = 2024 + 0.2,
y = y,
label = label
),
hjust = 0,
lineheight = 0.9,
label.size = NA,
size = 5,
family = font_family
) +
# Configure the x-axis with major and minor breaks
scale_x_continuous(
breaks = seq(2008, 2024, 2),
minor_breaks = 2007:2024
) +
# Manually assign fill colors for the dependency types
scale_fill_manual(values = colors_viz_4) +
# Allow annotations to extend outside the plot area
coord_cartesian(clip = "off") +
# Set plot titles and descriptions using inline color styling
labs(
title = str_glue("Number of packages on CRAN in 2024 <span style='color:{color_1}'>depending on</span>, <span style='color:{color_2}'>importing</span>, or <span style='color:{color_3}'>suggesting</span> {{ggplot2}}"),
subtitle = "Aggregated by the initial package release years. Categories may change from one version to another and were taken from the latest versions.",
caption = "Data: CRAN via {pkgsearch} | Visualization: Antti Rask | Updated: 2024-12-31"
) +
# Apply a clean, minimal theme and customize text styling
theme_minimal(base_family = font_family) +
# Customize title and caption styling
theme(
axis.text.x = element_text(
size = 14,
face = "bold",
margin = margin(10, 0, 0, 0)
),
axis.text.y = element_blank(),
axis.title = element_blank(),
legend.position = "none",
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
plot.margin = margin(10, 50, 10, 10),
plot.title = element_markdown(
face = "bold",
size = 16,
hjust = 0.5
),
plot.subtitle = element_text(
hjust = 0.5,
margin = margin(0, 0, 20, 0)
),
plot.caption = element_text(
size = 10,
color = darken("darkgrey", 0.4),
hjust = 0.5,
margin = margin(20, 0, 0, 0)
)
)
Streamgraph, like other area graphs, isn’t the best for making detailed comparisons between groups. But it’s excellent for creating a broader picture of what’s going on in the data.
3.2 Bar charts
Bar charts are the backbone of data visualization. And ggplot2 handles most types of bar charts out of the box. However, there are situations where you may want to take it a step further.
In this section, we’ll take a look at the Likert chart and the mosaic chart, also known as the Marimekko chart or the Mekko chart, after the company (go Finland!).
3.2.1 Likert chart
A Likert chart is a diverging stacked bar chart. It’s used to visualize responses to a questionnaire using the Likert (or similar) scale format. The “respondents are asked to indicate their degree of agreement or disagreement on a symmetric agree-disagree scale for each of a series of statements” (Burns and Bush 2007).
For Likert charts, ggstats (Larmarange 2025a) is my package of choice. It also has other functionalities, which we’ll return to in Section 4.14.1.
3.2.1.1 Viz #5: PISA 2022 Questionnaire
I was thinking of a good, open data source for the Likert chart. Most packages (ggstats included) use older PISA (Programme for International Student Assessment) questionnaires for the example data. I was able to find more recent data and selected seven statements from the PISA 2022 Student Questionnaire (OECD 2023). There was a lot of data to choose from, but I selected the statements that a) dealt with creativity and b) had the fewest number of NAs in the answers.
I chose three countries to compare. Canada, Finland and Great Britain. I would’ve been interested in seeing a comparison between the US as well, but sadly the data was all NAs for these statements.
Let’s first take a look at the data we have.
There is a country column, but the other columns each have answers to one of the statements. They all have a technical ID as a name (e.g., ST339Q04JA) and these possible values:
- Strongly disagree
- Disagree
- Agree
- Strongly agree
- NA
We can already take a look at what this data would look like visualized with one function, gglikert()
. We don’t want to include the country column yet.
By default, gglikert()
ignores the NA values. If we wanted to, we could turn them into a character string. And then use the exclude_fill_values argument to “[count] them in the denominator for computing proportions” (Larmarange 2025a).
Now, two things are wrong with this first plot.
- We don’t know what the statements were (unless we go and look elsewhere)
- The categories aren’t in the correct order. The default order for the character type is alphabetical
First, let’s look at the labels we have in a separate tibble.
Let’s turn them into a named vector that we can use. We’ll use the deframe()
function from tibble (Müller and Wickham 2025) for that.
ST339Q04JA
"Creativity can only be expressed through the arts."
ST339Q06JA
"It is possible to be creative in nearly any subject."
ST341Q01JA
"I enjoy creating art."
ST341Q02JA
"I enjoy artistic activities."
ST341Q03JA
"I express myself through art."
ST341Q04JA
"I reflect on movies I watch."
ST341Q05JA
"I see beauty in everyday things."
Then we’ll create another vector for the agreement levels. Notice that they are now in the order we want them to be in.
agreement_levels <- c(
"Strongly disagree",
"Disagree",
"Agree",
"Strongly agree"
)
We’ll convert the statement columns in our data set into factors. And we’ll use the fct_relevel()
function to reorder those factors in the desired order.
Let’s see what the plot looks like now after those two changes.
pisa_2022_statements_refactored %>%
gglikert(
include = -country,
variable_labels = pisa_2022_labels_vector
)
Much better! Let’s try adding the country as a grouping variable. We have two basic options for arguments, facet_cols and facet_rows. Let’s first see what it would look like if we facet the plot with the countries as columns.
pisa_2022_statements %>%
gglikert(
include = -country,
facet_cols = vars(country),
variable_labels = pisa_2022_labels_vector
)
That makes it quite hard to compare the values between the countries. Let’s see what it would look like if we facet the plot with the countries as rows.
pisa_2022_statements %>%
gglikert(
include = -country,
facet_rows = vars(country),
variable_labels = pisa_2022_labels_vector
)
That’s way too busy, and still, it’s hard to compare the countries. Luckily, we have a third option.
We can facet the rows using the statements as the upper level. And then use the countries as a lower level. This way, the plot is both readable and easier to compare between the countries.
The rest of the code is primarily used to make the visualization more presentable.
library(ggplot2) # for general plotting functions
library(ggstats) # for Likert charts
# Create a Likert chart for PISA 2022 student statements, faceted by statement
gglikert(
# Data
pisa_2022_statements_refactored,
# Include all columns except 'country'
include = -country,
# y-axis shows countries
y = "country",
# Use custom labels for statements
variable_labels = pisa_2022_labels_vector,
# Sort responses in descending order
sort = "descending",
# Facet rows by statement variable
facet_rows = vars(.question)
) +
# Adjust facet grid layout: wrap statement labels and move to the left side
facet_grid(
# Facet rows by statement variable
rows = vars(.question),
# Switch facet labels to the left side
switch = "y",
# Wrap long labels at 30 characters
labeller = label_wrap_gen(30)
) +
# Manually set Likert scale colors for each response level
scale_fill_manual(
values = c(
"Strongly disagree" = "#ca0020",
"Disagree" = "#f4a582",
"Agree" = "#92c5de",
"Strongly agree" = "#0571b0"
)
) +
# Add title, subtitle, and caption to the plot
labs(
title = "The majority of respondents in Canada, Finland, and Great Britain feel creative, even if they don't all express themselves through art",
subtitle = "PISA 2022 Student Questionnaire - a sample of 7 statements",
caption = "Data: OECD | Visualization: Antti Rask"
) +
# Apply custom theme styling for text, facets, and layout
theme(
text = element_text(family = "wqy-microhei"),
axis.text = element_text(color = "#333333"),
strip.background.y = element_rect(fill = "#FFD166"),
strip.placement.y = "top", # Place facet labels on top, or in this case, left
strip.text.y.left = element_text(angle = 0, hjust = 0.5, color = "#333333"),
legend.margin = margin(0, 0, 0, 0),
plot.caption = element_text(hjust = 0, color = "#333333"),
plot.caption.position = "plot",
plot.margin = margin(10, 10, 10, 10),
plot.title.position = "plot"
)
The kids seem to be alright! At least when it comes to feeling creative. There are some interesting differences between the three countries, but nothing too extreme.
3.2.2 Mosaic chart
A mosaic chart is a special type of stacked bar chart. It differs from a normal one in that “the heights and widths of individual shaded areas vary” (Wilke 2019).
To create a mosaic chart, we’ll be using ggmosaic (Jeppson, Hofmann, and Cook 2021). You can use it to create bar charts, stacked bar charts, mosaic charts, and double-decker charts. But we’ll be concentrating on the mosaic chart.
To use ggmosaic, you’ll also have to install labelled (Larmarange 2025b). If you don’t, ggmosaic will remind you of this fact when you try to run your code for the first time.
3.2.2.1 Viz #6: IMDB movies, Part II
We return to the IMDB movies data set. This time looking at how many movies in each genre have a certain MPAA rating (among those movies that have a rating in the first place).
We’ll use the already familiar movies_na data set. But we’ll have to make some adjustments to get it ready for ggmosaic.
The new function here is fct_infreq()
. It helps us order the factor levels by size, even though we don’t have a separate column for those counts.
# Transform and filter the `movies_na` data set
movies_ggmosaic <- movies_na %>%
# Pivot genre columns (from Action to Short) into long format:
# Creates two columns: "genre" and "value"
pivot_longer(Action:Short, names_to = "genre") %>%
# Filter to keep only:
# - rows where the movie is flagged in that genre
# - rows where the mpaa value is not NA
filter(
value != 0,
!is.na(mpaa)
) %>%
# Select only the mpaa and genre columns
select(mpaa, genre) %>%
# Convert both columns to a factor
mutate(
# For mpaa the levels and their order is clear
mpaa = mpaa %>% factor(levels = c("PG", "PG-13", "R", "NC-17")),
# For genre, we'll use fct_infreq to order the levels by number of
# observations with each level (largest first)
genre = genre %>% as_factor() %>% fct_infreq(.)
)
That’s all we need to take a look at the basic functionality of geom_mosaic()
.
We’ll wrap mpaa with a helper function, product()
to prepare it for the x-axis. Then assign genre to the fill argument. Offset lets us set the space between the tiles and show.legend does exactly what the name promises.
We can use this basic version as a basis for the final version. So let’s also assign the plot to plot_ggmosaic.
library(ggmosaic) # for creating mosaic charts (for categorical data)
library(ggplot2) # for general plotting
# Build a mosaic plot of movie ratings (mpaa) vs. genres
plot_ggmosaic <- ggplot(movies_ggmosaic) +
geom_mosaic(
aes(
# x-axis shows mpaa rating categories
x = product(mpaa),
# Fill color represents movie genre
fill = genre
),
# Add small spacing between tiles
offset = 0.02,
# Hide legend (optional)
show.legend = FALSE
)
# Render the plot
plot_ggmosaic
You get the idea. Next, let’s see what we need to do to get this into a more presentable state.
- Add text labels to show the counts of the different tiles (except the small ones)
- Change the color palette with ggthemes
- Use a custom theme,
theme_ggmosaic()
that comes with ggmosaic
First, let’s take a look at those text labels. ggmosaic comes with a geom_mosaic_text()
function, but we’ll use geom_text()
from ggplot2 instead. Some of the tiles are so small it doesn’t make sense to add labels to them. This way we have more control of the labels and we can decide not to show labels when the counts are too small (under 70 in this case).
We’re using the data from the plot_ggmosaic object. It’s a list, after all, containing all sorts of data we can use. We’ll calculate the label positions, xpos and ypos. Then we’ll count the combinations of mpaa and genre and combine the two into one tibble.
library(dplyr) # for data manipulation
library(purrr) # for pluck()
# Extract built plot data (data for all geoms)
p_built <- ggplot_build(plot_ggmosaic)
# Get tile data for the first layer and compute tile centers
tile_data <- p_built %>%
# Extract data for first layer of plot
pluck("data", 1) %>%
# Compute x and y centers of each tile
mutate(
xpos = (xmin + xmax) / 2,
ypos = (ymin + ymax) / 2
)
# Count original observations for each (mpaa, genre) pair
counts <- movies_ggmosaic %>%
count(mpaa, genre)
# Join counts to tile data, add label only if count >= 70
tile_data_with_counts <- tile_data %>%
left_join(
counts,
by = c(
"x__mpaa" = "mpaa",
"x__fill__genre" = "genre"
)
) %>%
# Drop labels for small counts
mutate(n_label = if_else(n < 70, NA, n)) %>%
# Keep only label and tile center coordinates
select(n_label, xpos, ypos)
With our text labels ready, we can create the final visualization. We start with the plot_ggmosaic and start adding other layers on top.
We have as many as seven categories (genres). We’ll use scale_fill_colorblind()
from ggthemese to create a nice colorblind-friendly color palette.
The custom theme, theme_ggmosaic()
doesn’t need any arguments.
library(ggthemes) # for the color scale
# Add labels, colorblind palette, titles, and custom theme to the mosaic plot
plot_ggmosaic +
# Add text labels to tiles, using the prepared center positions
# and filtered counts
geom_text(
data = tile_data_with_counts,
aes(
x = xpos, # x-position: tile center x
y = ypos, # y-position: tile center y
label = n_label # Label: count (only if >= 70)
),
colour = "white", # White text for contrast
fontface = "bold",
size = 5
) +
# Use a colorblind-friendly palette for fills
scale_fill_colorblind() +
# Add title, subtitle, and caption
labs(
title = '"Yes More Drama in My Life"',
subtitle = "A count of different combinations of genre and MPAA ratings",
caption = "Data: IMDb movies (1893-2005) via {ggplot2movies} | Visualization: Antti Rask",
x = "",
y = ""
) +
# Apply a built-in mosaic theme as a base
theme_mosaic() +
# Additional theme tweaks for text styling
theme(
axis.text = element_text(
size = 14,
face = "bold"
),
plot.title = element_text(
face = "bold",
size = 16,
hjust = 0.5,
margin = margin(20, 0, 10, 0)
),
plot.subtitle = element_text(
size = 14,
hjust = 0.5,
margin = margin(0, 0, 0, 0)
),
plot.caption = element_text(
size = 10,
color = "darkgrey",
hjust = 0.5,
margin = margin(10, 0, 10, 0)
)
)
Mosaic chart isn’t meant for comparing small details. But it does give you a good overview of the proportions between the different category combinations.
3.3 Density charts
Density charts work similarly to histograms. They make it easy to visualize a distribution (or distributions) of data. Again, ggplot2 has a geom called geom_density()
that you can use for a basic density chart.
In this section, we’ll take a look at the raincloud chart and the ridgeline chart. The latter was formerly known as the joyplot, after Joy Division’s 1979 album Unknown Pleasures (Wilke 2017).
3.3.1 Raincloud chart
A raincloud chart isn’t a geom. It contains three different geoms, boxplot, violin (or at least half of one), and point. They were “presented in 2019 as an approach to overcome issues of hiding the true data distribution when plotting bars with errorbars — also known as dynamite plots or barbarplots — or box plots” (Scherer 2021).
You could create a raincloud chart using those individual geoms and mostly ggplot2. Cédric Scherer demonstrates this here, creating a raincloud chart using the Palmer Penguins data.
We have, however, a package, ggrain (Judd, van Langen, and Kievit 2024), that we can use for the same purpose. That there are numerous ways to accomplish the same task is both the beauty and the primary source of frustration with these tools. In any case, you get to choose which one works the best for you. You can also mix and match. You’ll see this when we reach the final visualization.
3.3.1.1 Viz #7: IMDB movies, Part III
We return, once again, to the IMDB movies data set. This time, looking at how the ratings for movies in three genres (Action, Documentary, and Short) are distributed. We’ll make some adjustments for ggrain. As you can see, data wrangling plays a significant role in the data visualization process. Context is key.
# Load required packages
library(dplyr) # For data manipulation
library(ggplot2movies) # Contains the 'movies' dataset
library(tidyr) # For pivoting functions like pivot_longer()
# Prepare movie data for plotting or analysis
movies_ggrain <- movies %>%
# Reshape genre columns (Action to Short) into long format:
# each row becomes one movie-genre pair with value 0 or 1
pivot_longer(Action:Short, names_to = "genre") %>%
# Filter rows:
filter(
# Keep only selected genres
genre %in% c("Action", "Documentary", "Short"),
# Remove unusually long films
length < 500,
# Keep only genres that apply (value == 1)
value != 0,
# Focus on movies released after 2000
year > 2000
) %>%
# Keep only the genre and rating columns for further use
select(genre, rating)
We’re left with a lot of genre-rating pairings. Here’s what they look like using only the default settings of geom_rain()
.
# For raincloud-style plots (combining density, boxplot, and raw data)
library(ggrain)
# Create a raincloud plot showing rating distributions by genre
movies_ggrain %>%
ggplot(aes(genre, rating, fill = genre)) +
geom_rain()
We have the points on the left, the boxplot in the middle, and the half-violin on the right. This already tells us something. However, upon examining the points, there are too many to make much sense without taking action. We’ll get back to that in the final visualization.
Before we proceed, I would like us to examine the version where the different groups overlap. That requires us to use an additional package, ggpp (Aphalo 2025). We borrow the function position_dodgenudge()
from it. We’ll explore other uses of the package in Section 5.8.
The cov argument is used to assign a covariate to color the dots by. boxplot.args.pos lets us add a list of positional arguments for the boxplot. There are similar arguments for line (which we won’t use), point, and violin. Otherwise, it’s pretty much the same code as in the first example.
# For advanced position adjustments (e.g., position_dodgenudge)
library(ggpp)
# Create a raincloud plot with dodged and nudged boxplots for each genre
movies_ggrain %>%
ggplot(aes(x = 1, y = rating, fill = genre)) +
geom_rain(
# Group by 'genre' for separate rainclouds at the same x-position
cov = "genre",
# Pass custom positioning to the internal boxplot layer
boxplot.args.pos = list(
# Slightly separate boxplots by genre
position = position_dodgenudge(
# Nudge boxplots sideways
x = 0.1,
# Control dodge width (how far apart they spread)
width = 0.1
),
# Width of individual boxplots
width = 0.1
)
)
As you can see, the alpha argument would be much needed also here. It’s impossible to see all the overlapping elements without it.
Besides that, I can see uses for both the separate and overlapping versions of the chart. Let’s use the first one for the cleaned-up version.
One new function from the forcats package, fct_reorder()
. This is one of my most used forcats functions. It allows us to reorder the categories (genre) by the values in another column (rating). Super helpful with bar charts, and also the best tool to use here. It makes more sense to order categories like this, rather than in the default alphabetical order, if they can be compared.
We’ll use the stat_summary()
functions to foreshadow Chapter 4. One for counting and annotating the median, the other for counting and annotating the sample size. These two are copied from Cédric’s code and are here to remind us that you can often combine code from two sources. There are also some other touches, such as the darken()
/lighten()
effects from colorspace. They were also from that same source.
Since we are showing all the points in the plot, there’s no need to show the outliers in the boxplot. That is done by assigning NA to the outlier.shape argument.
# Define custom fill colors for each genre
colors_vis_7 <- c(
"Action" = "#f487b6",
"Documentary" = "#3772FF",
"Short" = "#43B929"
)
# Define axis label color order (to match reordered factor levels)
colors_vis_7_axis <- c("#3772FF", "#43B929", "#f487b6")
# Define custom summary function to display sample sizes for each genre
add_sample <- function(x) {
return(
c(
# Y-position just above the max
y = max(x) + .025,
# Sample size as label
label = length(x)
)
)
}
# Load required packages
library(colorspace) # For color manipulation (darken/lighten)
library(forcats) # For factor reordering with fct_reorder()
library(ggplot2) # For base plotting
library(ggrain) # For raincloud-style plots
# Create the raincloud plot
ggplot(
movies_ggrain,
aes(
# Reorder genres by median rating
fct_reorder(genre, rating, .desc = TRUE),
rating,
fill = genre,
color = genre
)
) +
# Add raincloud plot (half density + boxplot + jittered points)
geom_rain(
# Draw rain (density) on the right side
rain.side = 'r',
# Customize boxplot style
boxplot.args = list(
color = "black",
# Hide outlier dots
outlier.shape = NA
),
# Set point transparency
point.args = list(alpha = 0.1),
# Jitter points horizontally
point.args.pos = list(position = position_jitter(seed = 1, width = .06))
) +
# Add median values as text labels above each raincloud
stat_summary(
geom = "text",
fun = "median",
aes(
# Format median to 2 decimals
label = round(after_stat(y), 2),
# Darken text color slightly
color = stage(genre, after_scale = darken(color, .2, space = "HLS"))
),
size = 3.5,
vjust = -5
) +
# Add sample size (n =) labels using custom summary function
stat_summary(
geom = "text",
fun.data = add_sample,
aes(
label = paste("n =", after_stat(label)),
color = stage(genre, after_scale = darken(color, .2, space = "HLS"))
),
size = 3.5,
hjust = -0.25
) +
# Manually set outline (color) values — darker than fill
scale_color_manual(
values = colors_vis_7,
# Hide color legend
guide = "none"
) +
# Manually set fill colors (lightened versions of outline)
scale_fill_manual(
values = lighten(colors_vis_7, .4, space = "HLS"),
# Hide fill legend
guide = "none"
) +
# Set y-axis ticks at whole rating points (1 to 10)
scale_y_continuous(breaks = seq(1, 10, by = 1)) +
# Flip axes for horizontal layout. Allow text to extend outside plot area
coord_flip(xlim = c(1.4, NA), clip = "off") +
# Add plot labels
labs(
x = NULL,
y = NULL,
title = "Documentaries have, on average, the best ratings",
subtitle = "Distribution of IMDB ratings (from 2000 to 2005) by genre",
caption = "Data: IMDB movies (1893-2005) | Visualization: Antti Rask"
) +
# Apply a clean minimal theme
theme_minimal() +
# Customize text, grid, and spacing
theme(
axis.text.y = element_text(
color = darken(colors_vis_7_axis, .1, space = "HLS"),
face = "bold",
size = 12
),
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(
face = "bold",
size = 14
),
plot.title.position = "plot",
plot.subtitle = element_text(
color = "grey40",
margin = margin(0, 0, 10, 0),
size = 12
),
plot.caption = element_text(
color = "grey40",
margin = margin(15, 0, 0, 0)
),
plot.margin = margin(15, 45, 10, 15),
text = element_text(family = "wqy-microhei"),
)
This way, we have the best of all three worlds. We can see that the Documentary genre has the highest median. It also has the strongest concentration of points (which also shows in the half-violin) around the median. Especially compared to Action.