How to create boxplot using ggplot2 in R programming

 

boxplot summarizes the distribution of a continuous variable for several categories. If categories are organized in groups and subgroups, it is possible to build a grouped boxplot.

Boxplot are built thanks to the geom_boxplot() geom of ggplot2. See its basic usage on the first example below. Note that reordering groups is an important step to get a more insightful figure. Also, showing individual data points with jittering is a good way to avoid hiding the underlying distribution.

Ggplot2 boxplot parameters

It describes the option you can apply to the geom_boxplot() function to custom the general chart appearance.

# Load ggplot2

library(ggplot2)

 

# The mpg dataset is natively available

#head(mpg)

 # geom_boxplot proposes several arguments to custom appearance

ggplot(mpg, aes(x=class, y=hwy)) +

    geom_boxplot(

               # custom boxes

        color="blue",

        fill="blue",

        alpha=0.2,

               # Notch?

        notch=TRUE,

        notchwidth = 0.8,

       

        # custom outliers

        outlier.colour="red",

        outlier.fill="red",

        outlier.size=3       )


Reorder a variable with ggplot2

 Reordering groups in a ggplot2 chart can be a struggle. This is due to the fact that ggplot2 takes into account the order of the factor levels, not the order you observe in your data frame. You can sort your input data frame with sort() or arrange(), it will never have any impact on your ggplot2 output.

This post explains how to reorder the level of your factor through several examples. Examples are based on 2 dummy datasets:

# Library

library(ggplot2)

library(dplyr)

 # Dataset 1: one value per group

data <- data.frame(

  name=c("north","south","south-east","north-west","south-west","north-east","west","east"),

  val=sample(seq(1,10), 8 )

)

 # Dataset 2: several values per group (natively provided in R)

# mpg

 

Method 1: the forcats library

The forcats library is a library from the tidyverse especially made to handle factors in R. It provides a suite of useful tools that solve common problems with factors. The fct_reorder() function allows to reorder the factor (data$name for example) following the value of another column (data$val here).

# load the library
library(forcats)
 # Reorder following the value of another column:
data %>%
  mutate(name = fct_reorder(name, val)) %>%
  ggplot( aes(x=name, y=val)) +
    geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
    coord_flip() +
    xlab("") +
    theme_bw()
 # Reverse side
data %>%
  mutate(name = fct_reorder(name, desc(val))) %>%
  ggplot( aes(x=name, y=val)) +
    geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
    coord_flip() +
    xlab("") +
    theme_bw()

 If you have several values per level of your factor, you can specify which function to apply to determine the order. The default is to use the median, but you can use the number of data points per group to make the classification:

# Using median

mpg %>%

  mutate(class = fct_reorder(class, hwy, .fun='median')) %>%

  ggplot( aes(x=reorder(class, hwy), y=hwy, fill=class)) +

    geom_boxplot() +

    xlab("class") +

    theme(legend.position="none") +

    xlab("")

 # Using number of observation per group

mpg %>%

  mutate(class = fct_reorder(class, hwy, .fun='length' )) %>%

  ggplot( aes(x=class, y=hwy, fill=class)) +

    geom_boxplot() +

    xlab("class") +

    theme(legend.position="none") +

    xlab("") +

    xlab("")

 

 Method 2: using dplyr only

The mutate() function of dplyr allows to create a new variable or modify an existing one. It is possible to use it to recreate a factor with a specific order. Here are 2 examples:

  • The first use arrange() to sort your data frame, and reorder the factor following this desired order.
  • The second specifies a custom order for the factor giving the levels one by one.
data %>%
  arrange(val) %>%    # First sort by val. This sort the dataframe but NOT the factor levels
  mutate(name=factor(name, levels=name)) %>%   # This trick update the factor levels
  ggplot( aes(x=name, y=val)) +
    geom_segment( aes(xend=name, yend=0)) +
    geom_point( size=4, color="orange") +
    coord_flip() +
    theme_bw() +
    xlab("")
 
data %>%
  arrange(val) %>%
  mutate(name = factor(name, levels=c("north", "north-east", "east", "south-east", "south", "south-west", "west", "north-west"))) %>%
  ggplot( aes(x=name, y=val)) +
    geom_segment( aes(xend=name, yend=0)) +
    geom_point( size=4, color="orange") +
    theme_bw() +
    xlab("")

 

Method 3: the reorder() function of base R

In case your an unconditional user of the good old R, here is how to control the order using the reorder() function inside a with() call:

# reorder is close to order, but is made to change the order of the factor levels.

mpg$class = with(mpg, reorder(class, hwy, median))

 

p <- mpg %>%

  ggplot( aes(x=class, y=hwy, fill=class)) +

    geom_violin() +

    xlab("class") +

    theme(legend.position="none") +

    xlab("")

p

 

Control ggplot2 boxplot colors

These for examples illustrate the most common color scales used in boxplot.

Note the use of RcolorBrewer and viridis to automatically generate nice color palette.

# library

library(ggplot2)

 

# The mtcars dataset is natively available in R

#head(mpg)

 # Top Left: Set a unique color with fill, colour, and alpha

ggplot(mpg, aes(x=class, y=hwy)) +

    geom_boxplot(color="red", fill="orange", alpha=0.2)

 # Top Right: Set a different color for each group

ggplot(mpg, aes(x=class, y=hwy, fill=class)) +

    geom_boxplot(alpha=0.3) +

    theme(legend.position="none")

 # Bottom Left

ggplot(mpg, aes(x=class, y=hwy, fill=class)) +

    geom_boxplot(alpha=0.3) +

    theme(legend.position="none") +

    scale_fill_brewer(palette="BuPu")

 # Bottom Right

ggplot(mpg, aes(x=class, y=hwy, fill=class)) +

    geom_boxplot(alpha=0.3) +

    theme(legend.position="none") +

    scale_fill_brewer(palette="Dark2")

 

Highlighting a group

To do so, first create a new column with mutate where you store the binary information: highlight or not. Then just provide this column to the fill argument of ggplot2 and eventually custom the appearance of the highlighted group with scale_fill_manual and scale_alpha_manual.

# Libraries

library(ggplot2)

library(dplyr)

library(hrbrthemes)

 # Work with the natively available mpg dataset

mpg %>%

   # Add a column called 'type': do we want to highlight the group or not?

  mutate( type=ifelse(class=="subcompact","Highlighted","Normal")) %>%

   # Build the boxplot. In the 'fill' argument, give this column

  ggplot( aes(x=class, y=hwy, fill=type, alpha=type)) +

    geom_boxplot() +

    scale_fill_manual(values=c("#69b3a2", "grey")) +

    scale_alpha_manual(values=c(1,0.1)) +

    theme_ipsum() +

    theme(legend.position = "none") +

    xlab("")

 

Grouped boxplot with ggplot2

A grouped boxplot is a boxplot where categories are organized in groups and subgroups.

Here we visualize the distribution of 7 groups (called A to G) and 2 subgroups (called low and high). Note that the group must be called in the X argument of ggplot2. The subgroup is called in the fill argument.

# library

library(ggplot2)

 # create a data frame

variety=rep(LETTERS[1:7], each=40)

treatment=rep(c("high","low"),each=20)

note=seq(1:280)+sample(1:150, 280, replace=T)

data=data.frame(variety, treatment ,  note)

 # grouped boxplot

ggplot(data, aes(x=variety, y=note, fill=treatment)) +

    geom_boxplot()

 

Using small multiple

Note that an alternative to grouped boxplot is to use faceting: each subgroup (left) or each group (right) is represented in a distinct panel.

# One box per treatment

p1 <- ggplot(data, aes(x=variety, y=note, fill=treatment)) +

    geom_boxplot() +

    facet_wrap(~treatment)

p1

# one box per variety

p2 <- ggplot(data, aes(x=variety, y=note, fill=treatment)) +

    geom_boxplot() +

    facet_wrap(~variety, scale="free")

p2

  

Grouped boxplot

Boxplot are often credited for hiding the underlying distribution of each category. Since individual data points are hidden, it is also impossible to know what sample size is available for each category.

In this example, box widths are proportional to sample size thanks to the varwidth option. On top of that, the exact sample size is added to the X axis labels for more accuracy.

# library

library(ggplot2)

 # create data

names <- c(rep("A", 20) , rep("B", 5) , rep("C", 30), rep("D", 100))

value <- c( sample(2:5, 20 , replace=T) , sample(4:10, 5 , replace=T), sample(1:7, 30 , replace=T), sample(3:8, 100 , replace=T) )

data <- data.frame(names,value)

 # prepare a special xlab with the number of obs for each group

my_xlab <- paste(levels(data$names),"\n(N=",table(data$names),")",sep="")

 # plot

ggplot(data, aes(x=names, y=value, fill=names)) +

    geom_boxplot(varwidth = TRUE, alpha=0.2) +

    theme(legend.position="none") +

    scale_x_discrete(labels=my_xlab)

 

 

Comments

Popular posts from this blog

How to create Animated 3d chart with R.

Linux/Unix Commands frequently used

R Programming Introduction