2  Data

This book/site is under construction

Everything is liable to change, even the title

2.1 Data sets

This book is about tools for creating visualizations. But to visualize data you first need data. So let’s start by taking a look at some of the data sets available to us without much hassle.

2.1.1 base R

Base R comes with a bunch of data sets ready to use. There are classics like iris and mtcars, but there are many more to choose from:

Since the datasets package comes from base R, the data is not always immediately ready to use with ggplot2 (Wickham 2024). Luckily we have the tidyverse (Wickham, Averick, et al. 2023) packages that make it easy to make the necessary changes.

Here is an example using the USArrests (‘Violent Crime Rates by US State’) data set. We can start by loading the data set by using the data() function:

data(USArrests)

Let’s take a quick look at what the first couple of rows of the data set looks like:

head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

As you can see, USArrests is a data frame. It has four columns and the names of the US states are row names. We would like to see them as the fifth column instead. We can use the tibble (Müller and Wickham 2023) package to do that. While we’re at it, let’s also change the data frame into a tibble:

library(tibble)

usarrests_tbl <- USArrests %>%
  rownames_to_column(var = "State") %>%
  as_tibble()

We can then use the new tibble to create a simple visualization with ggplot2:

library(ggplot2)

usarrests_tbl %>%
  slice_max(Murder, n = 10) %>% 
  ggplot(aes(Murder, UrbanPop)) +
  geom_point() +
  geom_label(aes(label = State))

2.1.2 IMDb movies (1893-2005)

ggplot2movies (Wickham 2015) used to be a part of the ggplot2 package itself. It’s now its own package to make ggplot2 lighter.

But it’s a cool little package. It has Internet Movie Database (IMDb) data about movies from between 1893 and 2005. The selected movies have “a known length and had been rated by at least one [IMDb] user.” (Wickham 2015).

The Movies data set has qualities that make it good for our needs. Let’s start by loading it:

library(ggplot2movies)

data(movies)

Let’s take a quick look at what some of the data looks like:

head(movies)

Movies is already a tibble. It consists of 58788 rows (observations) and 24 columns (variables).

When starting to work with a new data set it’s always good to take a look at the documentation. To understand what is in those rows and columns (and what is not):

Here are some of the reasons why Movies is a good example data set because it includes:

  • A goldilocks amount of data. Not too little, not too much
  • Categorical data of both nominal (title, genre) and ordinal (mpaa) kind
  • Numerical data of both continuous (budget, length, rating) and discrete (year, votes) kind

We can use two of those columns, year and rating to create a simple visualization with ggplot2:

library(ggplot2)

movies %>%
  ggplot(aes(year, rating)) +
  geom_point()

As mentioned earlier, Movies is already a tibble. But, it doesn’t mean that the data is in an optimal format for all kinds of visualization. But we’ll do all the necessary data wrangling within the chapter where we use the data.

2.1.3 RDatasets

RDatasets is not an R package. But it is an excellent GitHub repo. And a “collection of datasets originally distributed in various R packages” (Arel-Bundock 2024).

Here listed are the 2337 data sets that were available on 2024-11-11:

The RDatasets repo contains that same list. But there you will also find a .csv file and documentation for each data set.

If I had to choose one fun data set from the list to highlight, it would be starwars from the dplyr (Wickham, François, et al. 2023) package.

You can choose to use the .csv file provided on the website. Another way to use the collection is to choose the dataset from the list and load the package it comes with:

library(dplyr)

data(starwars)

Let’s take a quick look at what some of the starwars data looks like:

head(starwars)

There are a bunch of Star Wars characters and their stats.

Let’s choose two columns, height and species (and filter for six of the more well-known species). We’ll use them to create a simple visualization with ggplot2:

library(ggplot2)

starwars %>%
  filter(species %in% c("Droid", "Ewok", "Gungan", "Human", "Hutt", "Wookiee")) %>%
  ggplot(aes(height, species)) +
  geom_boxplot()

This concludes the section about the different data sets available for every R user. Next, we’ll take a look at some of the ggplot2 extensions that make it easier to do exploratory data analysis (EDA).

2.2 Exploratory data analysis (EDA)

Exploratory data analysis (or EDA, which we’ll be using from now on) is a process, even if it is a loose one. The Cambridge Dictionary ((Ed.) 2013) defines process as “a series of actions that you take in order to achieve a result”. So, we a) have a series of actions and b) a result you wish to achieve.

Let’s look at the results first. After all, that is why we do things. You might have other, more specific goals depending on your particular field or use case. But in the most general sense, the result we’re after is a better understanding of the data (set) we’re working on.

What are the series of actions we need to take? As I mentioned earlier, EDA is a loose process. There are as many ways to go about it as there are analysts and data sets. Still, some common steps usually occur: looking at missing values, summarizing data, and visualizing data.

It’s good to note these visualizations aren’t usually meant for publication. Compared to those you find later on in the book, these are more for your eyes than for the eventual audience.

The last thing we’ll look at in this chapter is one way to automate the EDA process using an app. Although I must warn you. It’s better to use these tools only after you’ve gained experience from doing EDA without them. It might sound counterintuitive, but trust me. It can be too overwhelming if you don’t know what you’re doing.

2.2.1 Missing values

Let’s first load the data. Looking at the movies data set earlier we noticed the mpaa column had many blank values. We don’t know if they didn’t have a rating in the first place. Or if they did, but the rating is missing. For the sake of this demonstration, let’s assume they all should have a rating.

We’ll begin by turning all the blank values (of the character type) into NA (not available).

library(dplyr)
library(ggplot2movies)

movies_na <- movies %>%
  mutate(
    # Turn all blank values of the character type into NA
    across(where(is.character), ~na_if(., "")), 
    # Create a decade column based on the values in the year column
    decade = floor(year / 10) * 10
  )

Naniar (Tierney et al. 2024) is a package with many functions for visualizing missing (NA) values. It does contain many functions outside of visualization. But that’s for another book.

One simple function is gg_miss_var(). It creates a lollipop plot ([INSERT LINK HERE]). It shows which columns (variables) contain the greatest amount of missing (NA) values.

We can see that MPAA ratings aren’t the only thing missing. Almost the same amount of films are missing the budget information. With budget, it’s easier to say that if we don’t have a number, it is missing. Then again, that column did have NAs in place from the beginning.

We’re also interested in seeing if there is overlapping missingness between the columns. We’ll use an upset plot ([INSERT LINK HERE]) for that. Just add gg_miss_upset().

movies_na %>% 
  gg_miss_upset(
    # Number of sets to look at. We know there are only two columns with NA
    nsets = 2 
  )

More than 3000 movies without an MPAA rating and almost the same amount without a budget. But over 50000 without both. That makes me think there is a consistency in the missingness.

We can also use geom_miss_point() to see if there are more patterns between the missing variables. Let’s also use label_number() from the scales (Wickham, Pedersen, and Seidel 2023) package to prettify the x-axis labels.

library(ggplot2)
library(scales)

p <- movies_na %>%
  ggplot(aes(budget, mpaa)) +
  geom_point() +
  geom_miss_point() +
  scale_x_continuous(
    labels = label_number(
      scale  = 1e-6,
      prefix = "$",
      suffix = "M"
    )
  )

Values seem to be missing in all the MPAA rating categories. NC-17 does not seem to contain that many values in general, but many of them seem to be missing.

Let’s confirm this observation by creating a frequency table with the tabyl() function. It’s from a neat package called janitor (Firke 2023).

library(janitor)

movies_na %>% 
  filter(mpaa == "NC-17") %>% 
  tabyl(mpaa, budget) %>%
  paged_table()

So, there are more NC-17 movies that are missing the budget than those that aren’t.

We can dig even deeper. We can use facet_wrap() from ggplot2 to see how the missing values are distributed throughout the history.

p +
  scale_x_continuous(
    n.breaks = 3,
    labels   = label_number(
      scale  = 1e-6,
      prefix = "$",
      suffix = "M"
    )
  ) +
  facet_wrap(vars(decade))

Before we move on to the next topic, let’s look at one interesting alternative, vis_miss(). Naniar imports it from visdat (Tierney 2023).

movies_na %>%
  vis_miss(
    # TRUE arranges the columns in order of missingness. Default value is FALSE.
    sort_miss       = TRUE,
    # Warn if there is large data? Default is TRUE.
    warn_large_data = FALSE 
  )

This one packs a punch. You see the amount of NAs in general. Also the amount of NAs in each column. And to top it off, we can “arrange the columns in order of missingness”. A great quick overview!

Now that we know where the missing data points are, one next step could be handling that missing data. There are different strategies for that, but this is outside the scope of this book.

2.2.2 Summarizing data

2.2.3 Visualizing data

Pairwise plot matrix

Correlation matrix

2.2.4 Automated EDA app