How to create histogram using ggplot2 in R Programming

 

A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar.

Histograms can be built with ggplot2 thanks to the geom_histogram() function. It requires only 1 numeric variable as input. This function automatically cut the variable in bins and count the number of data point per bin. Remember to try different bin size using the binwidth argument.

Basic histogram with geom_histogram

It is relatively straightforward to build a histogram with ggplot2 thanks to the geom_histogram() function. Only one numeric variable is needed in the input. Note that a warning message is triggered with this code: we need to take care of the bin width as explained in the next section.

Arguments

mapping

The aesthetic mapping, usually constructed with aes or aes_string. Only needs to be set at the layer level if you are overriding the plot defaults.

data

A layer specific dataset - only needed if you want to override the plot defaults.

stat

The statistical transformation to use on the data for this layer.

position

The position adjustment to use for overlappling points on this layer

other arguments passed on to layer. This can include aesthetics whose values you want to set, not map.

na.rm

If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed.

show.legend

logical. Should this layer be included in the legends? NA, the default, includes if any aesthetics are mapped. FALSE never includes, and TRUE always includes. It can also be a named logical vector to finely select the aesthetics to display.

inherit.aes

If FALSE, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders().

binwidth

The width of the bins. Can be specified as a numeric value or as a function that calculates width from unscaled x. Here, "unscaled x" refers to the original x values in the data, before application of any scale transformation. When specifying a function along with a grouping structure, the function will be called once per group. The default is to use the number of bins in bins, covering the range of the data. You should always override this value, exploring multiple widths to find the best to illustrate the stories in your data.

The bin width of a date variable is the number of days in each time; the bin width of a time variable is the number of seconds.

bins

Number of bins. Overridden by binwidth. Defaults to 30.

orientation

The orientation of the layer. The default (NA) automatically determines the orientation from the aesthetic mapping. In the rare event that this fails it can be given explicitly by setting orientation to either "x" or "y". See the Orientation section for more detail.

geom, stat

Use to override the default connection between geom_histogram()/geom_freqpoly() and stat_bin().

center, boundary

bin position specifiers. Only one, center or boundary, may be specified for a single plot. center specifies the center of one of the bins. boundary specifies the boundary between two bins. Note that if either is above or below the range of the data, things will be shifted by the appropriate integer multiple of binwidth. For example, to center on integers use binwidth = 1 and center = 0, even if 0 is outside the range of the data. Alternatively, this same alignment can be specified with binwidth = 1 and boundary = 0.5, even if 0.5 is outside the range of the data.

breaks

Alternatively, you can supply a numeric vector giving the bin boundaries. Overrides binwidthbinscenter, and boundary.

closed

One of "right" or "left" indicating whether right or left edges of bins are included in the bin.

pad

If TRUE, adds empty bins at either end of x. This ensures frequency polygons touch 0. Defaults to FALSE.

 

# library

library(ggplot2)

 # dataset:

data=data.frame(value=rnorm(100))

# basic histogram

p <- ggplot(data, aes(x=value)) +

  geom_histogram()

p

 

Control bin size with binwidth

A histogram takes as input a numeric variable and cuts it into several bins. Playing with the bin size is a very important step, since its value can have a big impact on the histogram appearance and thus on the message you’re trying to convey. Ggplot2 makes it a breeze to change the bin size thanks to the binwidth argument of the geom_histogram function.

# Libraries

library(tidyverse)

library(hrbrthemes)

 # Load dataset from github

data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)

 # plot

p <- data %>%

  filter( price<300 ) %>%

  ggplot( aes(x=price)) +

    geom_histogram( binwidth=3, fill="#69b3a2", color="#e9ecef", alpha=0.9) +

    ggtitle("Bin size = 3") +

    theme_ipsum() +

    theme(

      plot.title = element_text(size=15)

    )

p

 

Mirror density chart with ggplot2

A density chart is built thanks to the geom_density geom of ggplot2 (see a basic example). It is possible to plot this density upside down by specifying y = -..density. It is advised to use geom_label to indicate variable names.

# Libraries
library(ggplot2)
library(hrbrthemes)
 
# Dummy data
data <- data.frame(
  var1 = rnorm(1000),
  var2 = rnorm(1000, mean=2)
)
 # Chart
p <- ggplot(data, aes(x=x) ) +
  # Top
  geom_density( aes(x = var1, y = ..density..), fill="#69b3a2" ) +
  geom_label( aes(x=4.5, y=0.25, label="variable1"), color="#69b3a2") +
  # Bottom
  geom_density( aes(x = var2, y = -..density..), fill= "#404080") +
  geom_label( aes(x=4.5, y=-0.25, label="variable2"), color="#404080") +
  theme_ipsum() +
  xlab("value of x")
p
 

Histogram with geom_histogram

Of course it is possible to apply exactly the same technique using geom_histogram instead of geom_density to get a mirror histogram:

# Chart

p <- ggplot(data, aes(x=x) ) +

  geom_histogram( aes(x = var1, y = ..density..), fill="#69b3a2" ) +

  geom_label( aes(x=4.5, y=0.25, label="variable1"), color="#69b3a2") +

  geom_histogram( aes(x = var2, y = -..density..), fill= "#404080") +

  geom_label( aes(x=4.5, y=-0.25, label="variable2"), color="#404080") +

  theme_ipsum() +

  xlab("value of x")

 p

 

Histogram with several groups - ggplot2

If the number of group or variable you have is relatively low, you can display all of them on the same axis, using a bit of transparency to make sure you do not hide any data.

Note: with 2 groups, you can also build a mirror histogram

# library
library(ggplot2)
library(dplyr)
library(hrbrthemes)
# Build dataset with different distributions
data <- data.frame(
  type = c( rep("variable 1", 1000), rep("variable 2", 1000) ),
  value = c( rnorm(1000), rnorm(1000, mean=4) )
)
# Represent it
p <- data %>%
  ggplot( aes(x=value, fill=type)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
    scale_fill_manual(values=c("#69b3a2", "#404080")) +
    theme_ipsum() +
    labs(fill="")

p

Using small multiple

If the number of group you need to represent is high, drawing them on the same axis often results in a cluttered and unreadable figure.

A good workaroung is to use small multiple where each group is represented in a fraction of the plot window, making the figure easy to read. This is pretty easy to build thanks to the facet_wrap() function of ggplot2.

# Load dataset from github

data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",")

data <- data %>%

  gather(key="text", value="value") %>%

  mutate(text = gsub("\\.", " ",text)) %>%

  mutate(value = round(as.numeric(value),0))

# plot

p <- data %>%

  mutate(text = fct_reorder(text, value)) %>%

  ggplot( aes(x=value, color=text, fill=text)) +

    geom_histogram(alpha=0.6, binwidth = 5) +

    scale_fill_viridis(discrete=TRUE) +

    scale_color_viridis(discrete=TRUE) +

    theme_ipsum() +

    theme(

      legend.position="none",

      panel.spacing = unit(0.1, "lines"),

      strip.text.x = element_text(size = 8)

    ) +

    xlab("") +

    ylab("Assigned Probability (%)") +

    facet_wrap(~text)

 

Two Histograms with melt colors

Histograms are commonly used in data analysis to observe distribution of variables. A common task in data visualization is to compare the distribution of 2 variables simultaneously.

Here is a tip to plot 2 histograms together (using the add function) with transparency (using the rgb function) to keep information when shapes overlap.

#Create data

set.seed(1)

Ixos=rnorm(4000 , 120 , 30)    

Primadur=rnorm(4000 , 200 , 30)

 # First distribution

hist(Ixos, breaks=30, xlim=c(0,300), col=rgb(1,0,0,0.5), xlab="height",

     ylab="nbr of plants", main="distribution of height of 2 durum wheat varieties" )

# Second with add=T to plot on top

hist(Primadur, breaks=30, xlim=c(0,300), col=rgb(0,0,1,0.5), add=T)

# Add legend

legend("topright", legend=c("Ixos","Primadur"), col=c(rgb(1,0,0,0.5),

     rgb(0,0,1,0.5)), pt.cex=2, pch=15 )

par(

  mfrow=c(1,2),

  mar=c(4,4,1,0)

)

hist(Ixos, breaks=30 , xlim=c(0,300) , col=rgb(1,0,0,0.5) , xlab="height" , ylab="nbr of plants" , main="" )

hist(Primadur, breaks=30 , xlim=c(0,300) , col=rgb(0,0,1,0.5) , xlab="height" , ylab="" , main="")

 

Boxplot on top of histogram

This example illustrates how to split the plotting window in base R thanks to the layout function. Contrary to the par(mfrow=...) solution, layout() allows greater control of panel parts.

Here a boxplot is added on top of the histogram, allowing to quickly observing summary statistics of the distribution.

# Create data

my_variable=c(rnorm(1000 , 0 , 2) , rnorm(1000 , 9 , 2))

 # Layout to split the screen

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))

 # Draw the boxplot and the histogram

par(mar=c(0, 3.1, 1.1, 2.1))

boxplot(my_variable , horizontal=TRUE , ylim=c(-10,20), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)

par(mar=c(4, 3.1, 1.1, 2.1))

hist(my_variable , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(-10,20))


Histogram with colored tail

This example demonstrates how to color parts of the histogram. First of all, the hist function must be called without plotting the result using the plot=F option. It allows to store the position of each bin in an object (my_hist here).

Those bin borders are now available in the $breaks slot of the object, what allows to build a color vector using ifelse statements. Finally, this color vector can be used in a plot call.

# Create data

my_variable=rnorm(2000, 0 , 10)

# Calculate histogram, but do not draw it

my_hist=hist(my_variable , breaks=40  , plot=F)

 # Color vector

my_color= ifelse(my_hist$breaks < -10, rgb(0.2,0.8,0.5,0.5) , ifelse (my_hist$breaks >=10, "purple", rgb(0.2,0.2,0.2,0.2) ))

 # Final plot

plot(my_hist, col=my_color , border=F , main="" , xlab="value of the variable", xlim=c(-40,40) )

 

Comments

Popular posts from this blog

How to create Animated 3d chart with R.

Linux/Unix Commands frequently used

R Programming Introduction