Data visualization is the step in the data analysis process where, if done right, “[…] gives you answers to questions you didn’t know you had” (Ben Schneiderman). Alongside keeping our understanding of trends and patterns, the visual language that should effectively communicate the insights generated through the analysis is probably even more important.
Data analysts now have various tools in their arsenal for creating stunning visualization, including numerous R packages (collection of R functions, compiled code, and sample data) developed purposefully for different kinds of visualizations. Some of these packages even have a life of their own beyond data analytics. For instance, in recent years, the aesthetics of data visualization inspired the creation of packages for generative art, such as flametree (Figure 1).
Figure 1 Study of still life 68 (Mahir Hrnjic)
In this Tech Bite, we will go through some of the most common packages in R used for data visualization. The aim is to highlight the pros and cons of each package, as well as the general or specific use for which they were compiled. But before we jump into different data visualization packages, it is worth mentioning that R already has built-in functionality for plotting, referred to as base graphics. Packages basically extended the functionality of base graphics, usually built around a specific plotting framework. Despite a large number of packages tailored with data visualization purposes in mind, base graphics are still preferred by many R users.
Whoever stepped foot in the R ecosystem has probably heard and worked with the ggplot2 package. It is considered a standard data visualization package to the degree that base graphics are just a stepping stone to the ggplot2. The package is based on Grammar of Graphics, a plotting framework that, like in language, views that graphics should also consist of basic grammatical elements. For instance, a simple sentence would consist of subject + verb + object, while a graph is built upon layers of data + aesthetics + geometry (Figure 2). The package has been designed around this theory, and plotting in ggplot2 includes adding layers of parameters (summary, metadata, and annotation). The syntax used is very user-friendly and intuitive (even though slightly different from the rest of R), more so than base graphics or other visualization packages. Furthermore, it ‘pushes’ its user to use tidy data principles in their datasets and is well integrated into the tidyverse collection of packages created with data science in mind.
library(ggrepel) library(ggplot2) library(ggthemes) library(extrafont) library(grid) library(cowplot) #select data and variables g1 <- ggplot(inq, aes(as.numeric(VALUE), as.numeric(EDUCATION))) #create scatter plot g2 <- g1 + geom_point(aes(color = REGION), shape = 21, fill = "white", size = 3, stroke = 1.5) #adding trend line (coefficient of determination) g3 <- g2 + geom_smooth(aes(fill="red"), method = "lm", formula = y ~ x, se=FALSE, linetype=1 , color= "red") #adding distance to the point labels g4 <- g3 + geom_text_repel(data = inq, aes(label=COUNTRY), size = 3, box.padding = unit(1.2, 'lines')) #adding color values to the points and trend line parameters g5 <- g4 + scale_color_manual( values = c("#F55840", "#924F3E", "#29B00E","#23576E", "#099FDB", "#208F84" )) + scale_fill_manual(name='My Lines', values=c("red"),labels=c("R^2=23%")) #positioning and editing legend g6 <- g5 + theme(legend.position="top", legend.title = element_blank(), legend.box = "horizontal", legend.text=element_text(size=8.5)) + guides(col = guide_legend(nrow = 1)) #editing grid parameters g7 <- g6 + theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(color = "gray50", size = 0.5), panel.grid.major.x = element_blank(), panel.background = element_blank(), line = element_blank()) #naming and editing plot axes g8 <- g7 + scale_x_continuous(expand = c(0.02, 0.02), n.breaks = 8, name = "Inequality index (0 = complete equality, 1 = complete inequality)") + scale_y_continuous(expand = c(0.02, 0.02), limits=c(30,100), n.breaks = 10, name = "Educational attainment (% of upper secondary degrees)") g9 <-g8 + theme(axis.ticks.length = unit(.15, "cm"), axis.ticks.y = element_blank(), axis.title.x = element_text(color="black", size=10, face="italic"), axis.title.y = element_text(color="black", size=10, face="italic")) #adding plot title and source information g10 <- g9+ ggtitle("Inequality and education development, 2018 \n") + theme(plot.title = element_text(hjust = 0.001, vjust= 2.12, colour="black",size = 14, face="bold")) g11 <- add_sub(g10,"Source: OECD Statistics, Better Life Index", x = 0.001, hjust = 0, fontface = "plain", size= 10) #create plot ggdraw(g11)
Figure 2 An example of a ggplot2 scatterplot
Lattice is a high-level data visualization system developed around the implementation of Trellis graphics in R. Trellis graphics were originally developed for S and S-PLUS to provide a convenient way of displaying multiple panels on a single page (also known as facet plots), i.e., condition type of plots (Figure 3). Therefore, lattice’s strong side is the visualization of multivariate data. Lattice is highly customizable, but many graph elements, such as margins, text, and spacing, are automatically adjusted. Compared to ggplot2, lattice is faster (4 to 5 times faster) and is better suited for larger datasets where a bunch of things must be plotted simultaneously. On the other hand, it can be less intuitive than ggplot2, and all plot parameters are given simultaneously. If you are working with a larger dataset and want to explore the relation across multiple variables, lattice is a great choice!
Figure 3 Trellis graphics used with a biplot and a 3D scatterplot
Esquisse is a package built on top of ggplot2 that provides interactive drag-and-drop visualizations. The package creates a Tableau-like interface for plotting in ggplot2 (Figure 4), allowing the user to go through the data quickly and without the code. Users can draw bar plots, curves, scatter plots, histograms, boxplot and export the graph or retrieve the code and make additional changes. Esquisse comes in handy with exploratory data analysis providing a quick and easy method for an analyst to go through the data. Users can then copy the code and adjust or add parameters in ggplot2 to create more refined plots. The package definitely saves time when different univariate and bivariate methods are being explored on different variables in a dataset. However, it still requires the user’s knowledge of ggplot2 syntax if they intend to produce presentation or publication-quality graphs.
Figure 4 GUI created once esquisse is used in R
Plotly is the second most popular visualization package in R after ggplot2. Whereas ggplot2 is used for static plots, plotly is used for creating dynamic plots. Similarly, it offers a plethora of options in terms of chart type we can visualize our data with. Still, compared to ggplot2, it is primarily utilized for producing interactive and 3D web-based plots (Figure 5) that are lacking in ggplot2 (recent extensions were added to allow interactive plots to be done in ggplot2). User can export their plot in .html and embed it on a website. One downside of the plotly package in R is the limited documentation available compared to other, more popular packages. Furthermore, producing a clear and well-annotated 3D graph can require substantial tinkering for newcomers. Finding an adequate solution for your visualization problem might take some time.
Figure5_plotly: An example of a 3D scatterplot created with plotly in R
There are currently thousands of packages in R that were created with a certain aim, whether to solve one specific laborious issue or to implement a whole new framework of how to approach certain steps in the data analysis process. Because of the quantity of packages, users can become creatures of habit and use one package despite there being others designed to ease the way something is being done. As we saw, we can save quite a bit of time on exploratory analysis by using esquisse before finalizing a graphic in ggplot2. If we aim to visualize a multivariate plot and to declutter the plot by creating a multi-panel plot, then lattice offers a faster and more simple way to do this (despite not being pretty as ggplot2 plots). The functional gap between ggplot2 and plotly is narrowing done since new extensions are being created that allow users to create dynamic plots through ggplot2. Nevertheless, plotly is still more convenient and better looking if we aim to produce a dynamic web-based plot.
Learning the advantages and limitations of different plotting systems in R will hopefully allow users to save time and result in them building an effective analytical framework.
“Comparing different plotting systems in R” Tech Bite was brought to you by Mahir Hrnjić, Data Analyst at Atlantbh.
Tech Bites are tips, tricks, snippets or explanations about various programming technologies and paradigms, which can help engineers with their everyday job.