install.packages("tidyverse")
Introduction
Everything is liable to change, even the title
Data visualization is one of those things that can look easy when you see someone doing it well. Then you try it yourself and it’s easy to feel like you know nothing.
I wanted to share the tools that make it easier to create those beautiful visualizations. And to show you how to use those tools together, not only with ggplot2 (Wickham 2024) but with each other.
One bit about terminology before we go any further. Chart, diagram, graph, plot, and visualization are all terms that people have tried to give strict definitions to. But in reality, they get used more or less interchangeably. Is it a bar chart or a bar graph? Yes.
In the context of this book, I try to use the terms consistently in this way:
- Chart / Diagram / Graph: Type of visualization (e.g. bar chart, Venn diagram, or line graph)
- Plot: Output of some code to create a visualization. It can contain many charts, diagrams, and/or graphs. But sometimes it’s also used to mean a type of visualization (e.g. scatter plot)
- Visualization: The most general of the terms. It can be used to talk about visualization as a concept. But sometimes it’s also used for a more specific case.
Purpose and scope
A very basic reason why I started writing this book was the fact that one didn’t already exist. “Someone had to do it”, as they say.
I know there is information out there about the different ggplot2 extension packages. There are the package documentation, but there are also some blogs around (see Further resources).
But that information is quite scattered. There were many great packages I hadn’t even heard about before I started doing research for the book.
I’ve tried my best to test and write about the most useful among them. After all, there are hundreds of packages to choose from. And I don’t think everyone should have to go through them all. Don’t try this at home!
What you will learn
This book continues from where ggplot2: Elegant Graphics for Data Analysis (Wickham 2016) left off. In his book, Hadley does mention some extension packages by name. But until now there hasn’t been a comprehensive guide to those extension packages. And how to use them together.
Basics works as kind of a “Previously, on ggplot2…”. A reminder of the basic function(alitie)s of the core package.
Layers goes through the layers that make up the grammar of graphics: Data, geometric objects (Geoms), statistical transformations (Stats), titles and labels (Annotations), color, position and size of visual elements (Scales and guides), Coordinate systems, small multiples (Facets), and Themes. They are in a logical order. As are the rest of the chapters. The same order that you would build your visualization in.
Special cases hosts not-so-typical visualizations: Maps, networks, interactive plots, animated plots, and graphs integrated in tables.
Shortcuts contains different kinds of helpers and interactive tools. Designed to make it easier to create your visualizations.
At some point, we will be getting ready to publish our visualization. You’ll want to give them finishing touches, but it’s also good to know about arranging plots.
And while ggplot2 is an R package, its impact and influence go beyond R. So we’ll talk about Python a bit.
To finish off, we go through some case studies. We build visualizations, step-by-step, from data to publication.
What you won’t learn
These are some of the interesting topics that I had to leave out of the book. It’s long as it is and I decided to concentrate on the topics and extensions that everyone can use. Regardless of the field you’re in.
Combining visualizations and BI tools
Using R in general and ggplot2 visualizations in specific within BI tools like Power BI is cool. Interested? You can start by reading Luca Zavarella’s book Extending Power BI with Python and R.
Generative art
Creating art with R - color me impressed! It’s called Rtistry or aRt, because why not? There isn’t enough room in the book for a proper chapter on this, but here are some links to get you started on the topic:
Large language models (LLM)
I know, it would be trendy to write about AI. And I do think they can play a part in the data visualization process. But there isn’t anything that meaningful to write about. Whatever you do, whether you use AI or not, do trust your senses!
A machine doesn’t need to visualize data to use it. So, in a way, data visualization is one of the last bastions of human-to-human communication. By humans, for humans, am I right?
Specialized packages in specialized fields like bioinformatics
This one should be self-evident. “Write what you know”. I know nothing about bioinformatics. I’m not even sure I understand what the term means.
Are you interested in bioinformatics? I mean, I don’t want to leave you hanging. There is a website and a suite of packages called Bioconductor that should get you started!
Sports
It’s the same as with generative art. Interesting topic, but not enough space. Still, here’s another short list of packages related to sports:
Baseball
Football
Soccer
Various
Who is the book for?
I assume that you’re reading this because you’re interested in data visualization, R, or both.
I already wrote about this in the Welcome section. This book is for people who already have a working knowledge of ggplot2 and the grammar of graphics. I’m quite adamant on that. You should learn the foundations first. And there are already excellent books and websites for that.
Other than that, the tools and techniques in the book aren’t all that advanced (at least most of them). You don’t have to be a visualization wizard to find the book useful. Each of the chapters will contain code that you can use almost verbatim.
It doesn’t even matter if R isn’t your native language. You might find an inspiring R package and then go on and find a similar package in your preferred language. Or write that package, if one doesn’t already exist. It’s all good!
Who am I to write this book?
Hi! I’m Antti Rask, a data analyst if you had to choose one label. I’ve lived many lives and had many careers before finding my truest calling in data.
I’ve liked data visualizations longer than I’ve been using R (which I started in late 2019/early 2020). I made my first visualizations with Excel. At some point, I switched to Power BI. Which is still my main tool for visualization at work.
You could say that I am an artistic type. Music has been the main avenue for my artistic self-expression for most of my life. But I’ve also liked to take photos of street art. I come across them on my adventures in Helsinki, or when I’m abroad.
In visualizing data I find myself leaning more towards the Stephen Few school of thought. Less is usually more and you should avoid chartjunk at all costs. Still, I’m not a purist, by any means.
I’ve co-authored the RandomWalker (Sanderson and Rask 2024) R package with Steven Paul Sanderson II (@spsanderson). Specifically the function RandomWalker::visualize_walks()
. It uses ggplot2 and some of the extensions found in this book.
I also like to tinker with R Shiny and recently made an app called TuneTeller. It received an honorable mention in the 2024 Shiny Contest.
R
All the code in this book is in R. Except for 18 Python. If you have never used R, R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) is a better place to start.
Don’t have R yet? You can get the latest version from CRAN (the comprehensive R archive network): https://www.r-project.org/.
IDE
RStudio is my integrated development environment (IDE) of choice. It’s free and open-source. It’s the environment where you can try out the code you find in this book. And it’s also where I wrote this book.
You can download the latest version from the Posit website: https://posit.co/download/rstudio-desktop/. Posit is the company that develops RStudio and many R packages. Like the tidyverse (Wickham et al. 2023) (see ggplot2 & tidyverse).
There are other alternatives like VSCode, Positron, and JupyterLab. This book does operate under the assumption that you are using RStudio.
Packages
ggplot2 & tidyverse
The tidyverse is a collection of R packages. ggplot2 is one of them. The others are dplyr, forcats, lubridate, purrr, readr, stringr, tibble, and tidyr.
You could install the packages you need one by one, but you might as well install the whole tidyverse:
To highlight each of the packages used, I’ve decided to load the individual packages in the code examples. If you want to, you can use the following:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The previous message mentions conflicted (Wickham 2023). It’s not part of the core tidyverse, but it’s one of my favorite packages. And I do recommend loading it whenever you’re writing new code.
It checks for conflicting function names and forces you to choose the one you want to use. Either use case by use case (for example dplyr::filter()
) or session-wide (for example conflicts_prefer(dplyr::filter())
).
To install conflicted:
install.packages("conflicted")
All you need to do is load the package at the beginning of your script:
It will then let you know if a conflict emerges. One fewer reason to have your code act weird when running it.
ggplot2 extensions
Here are the main characters of this book.
Some of them you can install from CRAN:
install.packages(
c(
"DataExplorer",
"GGally",
"ggcorrplot",
"ggplot2movies",
"gt",
"gtExtras",
"naniar",
)
)
Others you will have to install from GitHub:
Others
Here are the packages that are not part of the tidyverse. They aren’t considered to be ggplot2 extensions either. They are still useful for what we want to do and worth highlighting instead of hiding.
install.packages(
c(
"janitor",
"scales"
)
)
Next steps?
If you are reading these chapters in chronological order, the next step will be to read the rest of the book.
Of course, I recommend you don’t only read the book. I’ve learned the most from reading these technical books when I’ve also gone through the code examples. Debugged them (of course I hope the code I provide isn’t broken…) when needed. Even translated them to another language (Python to R) or dialect (base R to tidyverse).
Next, it’s practice, practice, practice. Recreate your favorite data visualizations using only ggplot2 and its extensions. Create something completely new. Either using your data (fitness, music, social media, etc.) or even something from 2 Data. There are also concepts like TidyTuesday (see Further resources).
Further resources
Blogs
Books
- BBC Visual and Data Journalism cookbook for R graphics
- Data visualisation using R, for researchers who don’t use R by Emily Nordmann et al.
- ggplot2: Elegant Graphics for Data Analysis (3e) by Hadley Wickham et al.
- An Introduction to ggplot2 by Ozancan Ozdemir
- Modern Data Visualization with R by Robert Kabacoff
- R Gallery Book by Kyle W. Brown
- R Graphics Cookbook, 2nd edition by Winston Chang
- Solutions to ggplot2: Elegant Graphics for Data Analysis by Howard Baek
- The Missing Book by Nicholas Tierney & Allison Horst
GitHub Repos
Websites
YouTube
Acknowledgments
Eevamaija Virtanen, Juha Korpela, and Säde Haveri for founding Helsinki Data Week with me. For helping me dream big! And for making me realize a big project like this is possible when you keep making constant progress.
Marc Eixarch (@Marceix) and Vicent Boned Riera (@eivicent) from the R User Group Finland. Learning R has been more fun thanks to your efforts in keeping the local R scene active in Helsinki, Finland.
Joe Reis, Ole Olesen-Bagneux, and Vin Vashishta for your generosity and time. And for setting a positive example for someone like me to follow.
Hadley Wickham (@hadley) for all the books and packages. R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) got me started with R. ggplot2: Elegant Graphics for Data Analysis (Wickham 2016) helped me understand ggplot2 and the grammar of graphics on a deeper level.
To be continued…
Colophon
This book was written in RStudio using Quarto. The website is hosted with Netlify. And it’s automatically updated after every commit by Github Actions. The complete source is available from GitHub.
This version of the book was built with R version 4.4.2 (2024-10-31) and the following packages: