How to create boxplot using ggplot2 in R programming
A boxplot summarizes the distribution of a continuous variable for several categories. If categories are organized in groups and subgroups, it is possible to build a grouped boxplot.
Boxplot are built thanks to the geom_boxplot() geom of ggplot2. See its basic usage on the first example below. Note that reordering groups is an important step to get a
more insightful figure. Also, showing individual data points with jittering is a good way to avoid hiding the underlying distribution.
Ggplot2 boxplot parameters
It describes the option you can apply to the geom_boxplot() function to custom the general
chart appearance.
# Load ggplot2
library(ggplot2)
# The mpg dataset is
natively available
#head(mpg)
# geom_boxplot proposes several arguments to custom appearance
ggplot(mpg, aes(x=class,
y=hwy)) +
geom_boxplot(
# custom boxes
color="blue",
fill="blue",
alpha=0.2,
# Notch?
notch=TRUE,
notchwidth = 0.8,
# custom outliers
outlier.colour="red",
outlier.fill="red",
outlier.size=3 )
Reorder a variable with ggplot2
This
post explains how to reorder the level of your factor through several examples.
Examples are based on 2 dummy datasets:
# Library
library(ggplot2)
library(dplyr)
data <- data.frame(
name=c("north","south","south-east","north-west","south-west","north-east","west","east"),
val=sample(seq(1,10), 8 )
)
# Dataset 2: several values per group (natively provided in R)
# mpg
Method 1: the forcats library
The forcats
library is a library from the tidyverse especially made to handle
factors in R. It provides a suite of useful tools that solve common problems
with factors. The fct_reorder() function allows to reorder the factor (data$name for example) following the value
of another column (data$val here).
# load the librarylibrary(forcats) # Reorder following the value of another column:data%>%
mutate(name =fct_reorder(name, val))%>%
ggplot(aes(x=name,y=val))+
geom_bar(stat="identity",fill="#f68060",alpha=.6,width=.4)+
coord_flip()+
xlab("")+
theme_bw()
# Reverse sidedata%>%
mutate(name =fct_reorder(name,desc(val)))%>%
ggplot(aes(x=name,y=val))+
geom_bar(stat="identity",fill="#f68060",alpha=.6,width=.4)+
coord_flip()+
xlab("")+
theme_bw()
# Using median
mpg %>%
mutate(class = fct_reorder(class, hwy,
.fun='median')) %>%
ggplot( aes(x=reorder(class, hwy), y=hwy,
fill=class)) +
geom_boxplot() +
xlab("class") +
theme(legend.position="none") +
xlab("")
# Using number of observation per group
mpg %>%
mutate(class = fct_reorder(class, hwy,
.fun='length' )) %>%
ggplot( aes(x=class, y=hwy, fill=class)) +
geom_boxplot() +
xlab("class") +
theme(legend.position="none") +
xlab("") +
xlab("")
The mutate() function of dplyr allows to create a new variable
or modify an existing one. It is possible to use it to recreate a factor with a
specific order. Here are 2 examples:
- The first use arrange() to sort your data frame, and
reorder the factor following this desired order.
- The second specifies a custom order
for the factor giving the levels one by one.
data%>%
arrange(val)%>%# First sort by val. This sort the dataframe but NOT the factor levels
mutate(name=factor(name,levels=name))%>%# This trick update the factor levels
ggplot(aes(x=name,y=val))+
geom_segment(aes(xend=name,yend=0))+
geom_point(size=4,color="orange")+
coord_flip()+
theme_bw()+
xlab("")
data%>%
arrange(val)%>%
mutate(name =factor(name,levels=c("north","north-east","east","south-east","south","south-west","west","north-west")))%>%
ggplot(aes(x=name,y=val))+
geom_segment(aes(xend=name,yend=0))+
geom_point(size=4,color="orange")+
theme_bw()+
xlab("")
Method 3: the reorder() function of base R
In case your an unconditional user of the
good old R, here is how to control the order using the reorder() function
inside a with() call:
# reorder is close to
order, but is made to change the order of the factor levels.
mpg$class = with(mpg,
reorder(class, hwy, median))
p <- mpg %>%
ggplot( aes(x=class, y=hwy, fill=class)) +
geom_violin() +
xlab("class") +
theme(legend.position="none") +
xlab("")
p
Control ggplot2 boxplot colors
These for examples illustrate the most common color
scales used in boxplot.
Note the use of RcolorBrewer and viridis to automatically
generate nice color palette.
# library
library(ggplot2)
# The mtcars dataset is
natively available in R
#head(mpg)
# Top Left: Set a unique color with fill, colour, and alpha
ggplot(mpg, aes(x=class,
y=hwy)) +
geom_boxplot(color="red",
fill="orange", alpha=0.2)
# Top Right: Set a different color for each group
ggplot(mpg, aes(x=class,
y=hwy, fill=class)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none")
ggplot(mpg, aes(x=class,
y=hwy, fill=class)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none")
+
scale_fill_brewer(palette="BuPu")
# Bottom Right
ggplot(mpg, aes(x=class,
y=hwy, fill=class)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none")
+
scale_fill_brewer(palette="Dark2")
Highlighting a group
To
do so, first create a new column with mutate where you store the
binary information: highlight or not. Then just provide this column to
the fill argument of ggplot2 and eventually custom the appearance of
the highlighted group with scale_fill_manual and scale_alpha_manual.
# Libraries
library(ggplot2)
library(dplyr)
library(hrbrthemes)
mpg %>%
# Add a column called 'type': do we want to highlight the group or not?
mutate( type=ifelse(class=="subcompact","Highlighted","Normal"))
%>%
# Build the boxplot. In the 'fill' argument, give this column
ggplot( aes(x=class, y=hwy,
fill=type, alpha=type)) +
geom_boxplot() +
scale_fill_manual(values=c("#69b3a2",
"grey")) +
scale_alpha_manual(values=c(1,0.1))
+
theme_ipsum() +
theme(legend.position =
"none") +
xlab("")
Grouped boxplot with ggplot2
A grouped boxplot is a boxplot where
categories are organized in groups and subgroups.
Here we visualize the distribution of 7 groups (called A
to G) and 2 subgroups (called low and high). Note that the group must be called
in the X argument
of ggplot2. The
subgroup is called in the fill argument.
# library
library(ggplot2)
# create a data frame
variety=rep(LETTERS[1:7],
each=40)
treatment=rep(c("high","low"),each=20)
note=seq(1:280)+sample(1:150,
280, replace=T)
data=data.frame(variety,
treatment , note)
# grouped boxplot
ggplot(data, aes(x=variety,
y=note, fill=treatment)) +
geom_boxplot()
Using small multiple
Note that an alternative to grouped boxplot is to
use faceting: each subgroup (left) or
each group (right) is represented in a distinct panel.
# One box per treatment
p1 <- ggplot(data,
aes(x=variety, y=note, fill=treatment)) +
geom_boxplot() +
facet_wrap(~treatment)
p1
# one box per variety
p2 <- ggplot(data,
aes(x=variety, y=note, fill=treatment)) +
geom_boxplot() +
facet_wrap(~variety,
scale="free")
p2
Grouped boxplot
Boxplot are
often credited for
hiding the underlying distribution of each category. Since individual data
points are hidden, it is also impossible to know what sample size is available
for each category.
In this example, box widths are proportional to sample
size thanks to the varwidth option. On top of that, the exact sample size is added to the X
axis labels for more accuracy.
# library
library(ggplot2)
# create data
names <- c(rep("A",
20) , rep("B", 5) , rep("C", 30), rep("D",
100))
value <- c( sample(2:5,
20 , replace=T) , sample(4:10, 5 , replace=T), sample(1:7, 30 ,
replace=T), sample(3:8, 100 , replace=T) )
data <- data.frame(names,value)
# prepare a special xlab with the number of
obs for each group
my_xlab <- paste(levels(data$names),"\n(N=",table(data$names),")",sep="")
# plot
ggplot(data, aes(x=names,
y=value, fill=names)) +
geom_boxplot(varwidth = TRUE,
alpha=0.2) +
theme(legend.position="none")
+
scale_x_discrete(labels=my_xlab)
Comments
Post a Comment