A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar.

Histograms can be built with ggplot2 thanks to the geom_histogram() function. It requires only 1 numeric variable as input. This function automatically cut the variable in bins and count the number of data point per bin. Remember to try different bin size using the binwidth argument.

Basic histogram with geom_histogram

It is relatively straightforward to build a histogram with ggplot2 thanks to the geom_histogram() function. Only one numeric variable is needed in the input. Note that a warning message is triggered with this code: we need to take care of the bin width as explained in the next section.

Arguments

mapping

The aesthetic mapping, usually constructed with aes or aes_string. Only needs to be set at the layer level if you are overriding the plot defaults.

data

A layer specific dataset - only needed if you want to override the plot defaults.

stat

The statistical transformation to use on the data for this layer.

position

The position adjustment to use for overlappling points on this layer

other arguments passed on to layer. This can include aesthetics whose values you want to set, not map.

`na.rm`	If `FALSE`, the default, missing values are removed with a warning. If `TRUE`, missing values are silently removed.
`show.legend`	logical. Should this layer be included in the legends? `NA`, the default, includes if any aesthetics are mapped. `FALSE` never includes, and `TRUE` always includes. It can also be a named logical vector to finely select the aesthetics to display.
`inherit.aes`	If `FALSE`, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders().
`binwidth`	The width of the bins. Can be specified as a numeric value or as a function that calculates width from unscaled x. Here, "unscaled x" refers to the original x values in the data, before application of any scale transformation. When specifying a function along with a grouping structure, the function will be called once per group. The default is to use the number of bins in `bins`, covering the range of the data. You should always override this value, exploring multiple widths to find the best to illustrate the stories in your data. The bin width of a date variable is the number of days in each time; the bin width of a time variable is the number of seconds.
`bins`	Number of bins. Overridden by `binwidth`. Defaults to 30.
`orientation`	The orientation of the layer. The default (`NA`) automatically determines the orientation from the aesthetic mapping. In the rare event that this fails it can be given explicitly by setting `orientation` to either `"x"` or `"y"`. See the Orientation section for more detail.
`geom, stat`	Use to override the default connection between `geom_histogram()`/`geom_freqpoly()` and `stat_bin()`.
`center, boundary`	bin position specifiers. Only one, `center` or `boundary`, may be specified for a single plot. `center` specifies the center of one of the bins. `boundary` specifies the boundary between two bins. Note that if either is above or below the range of the data, things will be shifted by the appropriate integer multiple of `binwidth`. For example, to center on integers use `binwidth = 1` and `center = 0`, even if `0` is outside the range of the data. Alternatively, this same alignment can be specified with `binwidth = 1` and `boundary = 0.5`, even if `0.5` is outside the range of the data.
`breaks`	Alternatively, you can supply a numeric vector giving the bin boundaries. Overrides `binwidth`, `bins`, `center`, and `boundary`.
`closed`	One of `"right"` or `"left"` indicating whether right or left edges of bins are included in the bin.
`pad`	If `TRUE`, adds empty bins at either end of x. This ensures frequency polygons touch 0. Defaults to `FALSE`.

# library

library(ggplot2)

# dataset:

data=data.frame(value=rnorm(100))

# basic histogram

p <- ggplot(data, aes(x=value)) +

geom_histogram()

Control bin size with binwidth

A histogram takes as input a numeric variable and cuts it into several bins. Playing with the bin size is a very important step, since its value can have a big impact on the histogram appearance and thus on the message you’re trying to convey. Ggplot2 makes it a breeze to change the bin size thanks to the binwidth argument of the geom_histogram function.

# Libraries

library(tidyverse)

library(hrbrthemes)

# Load dataset from github

data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)

# plot

p <- data %>%

filter( price<300 ) %>%

ggplot( aes(x=price)) +

geom_histogram( binwidth=3, fill="#69b3a2", color="#e9ecef", alpha=0.9) +

ggtitle("Bin size = 3") +

theme_ipsum() +

theme(

plot.title = element_text(size=15)

)

Mirror density chart with ggplot2

A density chart is built thanks to the geom_density geom of ggplot2 (see a basic example). It is possible to plot this density upside down by specifying y = -..density. It is advised to use geom_label to indicate variable names.

# Libraries

library(ggplot2)

library(hrbrthemes)

# Dummy data

data <- data.frame(

  var1 = rnorm(1000),

  var2 = rnorm(1000, mean=2)

 # Chart

p <- ggplot(data, aes(x=x) ) +

  # Top

  geom_density( aes(x = var1, y = ..density..), fill="#69b3a2" ) +

  geom_label( aes(x=4.5, y=0.25, label="variable1"), color="#69b3a2") +

  # Bottom

  geom_density( aes(x = var2, y = -..density..), fill= "#404080") +

  geom_label( aes(x=4.5, y=-0.25, label="variable2"), color="#404080") +

  theme_ipsum() +

  xlab("value of x")

Histogram with geom_histogram

Of course it is possible to apply exactly the same technique using geom_histogram instead of geom_density to get a mirror histogram:

# Chart

p <- ggplot(data, aes(x=x) ) +

geom_histogram( aes(x = var1, y = ..density..), fill="#69b3a2" ) +

geom_label( aes(x=4.5, y=0.25, label="variable1"), color="#69b3a2") +

geom_histogram( aes(x = var2, y = -..density..), fill= "#404080") +

geom_label( aes(x=4.5, y=-0.25, label="variable2"), color="#404080") +

theme_ipsum() +

xlab("value of x")

Histogram with several groups - ggplot2

If the number of group or variable you have is relatively low, you can display all of them on the same axis, using a bit of transparency to make sure you do not hide any data.

Note: with 2 groups, you can also build a mirror histogram

# library

library(ggplot2)

library(dplyr)

library(hrbrthemes)

# Build dataset with different distributions

data <- data.frame(

  type = c( rep("variable 1", 1000), rep("variable 2", 1000) ),

  value = c( rnorm(1000), rnorm(1000, mean=4) )

# Represent it

p <- data %>%

  ggplot( aes(x=value, fill=type)) +

    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +

    scale_fill_manual(values=c("#69b3a2", "#404080")) +

    theme_ipsum() +

    labs(fill="")

Using small multiple

If the number of group you need to represent is high, drawing them on the same axis often results in a cluttered and unreadable figure.

A good workaroung is to use small multiple where each group is represented in a fraction of the plot window, making the figure easy to read. This is pretty easy to build thanks to the facet_wrap() function of ggplot2.

# Load dataset from github

data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",")

data <- data %>%

gather(key="text", value="value") %>%

mutate(text = gsub("\\.", " ",text)) %>%

mutate(value = round(as.numeric(value),0))

# plot

p <- data %>%

mutate(text = fct_reorder(text, value)) %>%

ggplot( aes(x=value, color=text, fill=text)) +

geom_histogram(alpha=0.6, binwidth = 5) +

scale_fill_viridis(discrete=TRUE) +

scale_color_viridis(discrete=TRUE) +

theme_ipsum() +

theme(

legend.position="none",

panel.spacing = unit(0.1, "lines"),

strip.text.x = element_text(size = 8)

) +

xlab("") +

ylab("Assigned Probability (%)") +

facet_wrap(~text)

Two Histograms with melt colors

Histograms are commonly used in data analysis to observe distribution of variables. A common task in data visualization is to compare the distribution of 2 variables simultaneously.

Here is a tip to plot 2 histograms together (using the add function) with transparency (using the rgb function) to keep information when shapes overlap.

#Create data

set.seed(1)

Ixos=rnorm(4000 , 120 , 30)

Primadur=rnorm(4000 , 200 , 30)

# First distribution

hist(Ixos, breaks=30, xlim=c(0,300), col=rgb(1,0,0,0.5), xlab="height",

ylab="nbr of plants", main="distribution of height of 2 durum wheat varieties" )

# Second with add=T to plot on top

hist(Primadur, breaks=30, xlim=c(0,300), col=rgb(0,0,1,0.5), add=T)

# Add legend

legend("topright", legend=c("Ixos","Primadur"), col=c(rgb(1,0,0,0.5),

rgb(0,0,1,0.5)), pt.cex=2, pch=15 )

par(

mfrow=c(1,2),

mar=c(4,4,1,0)

)

hist(Ixos, breaks=30 , xlim=c(0,300) , col=rgb(1,0,0,0.5) , xlab="height" , ylab="nbr of plants" , main="" )

hist(Primadur, breaks=30 , xlim=c(0,300) , col=rgb(0,0,1,0.5) , xlab="height" , ylab="" , main="")

Boxplot on top of histogram

This example illustrates how to split the plotting window in base R thanks to the layout function. Contrary to the par(mfrow=...) solution, layout() allows greater control of panel parts.

Here a boxplot is added on top of the histogram, allowing to quickly observing summary statistics of the distribution.

# Create data

my_variable=c(rnorm(1000 , 0 , 2) , rnorm(1000 , 9 , 2))

# Layout to split the screen

layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))

# Draw the boxplot and the histogram

par(mar=c(0, 3.1, 1.1, 2.1))

boxplot(my_variable , horizontal=TRUE , ylim=c(-10,20), xaxt="n" , col=rgb(0.8,0.8,0,0.5) , frame=F)

par(mar=c(4, 3.1, 1.1, 2.1))

hist(my_variable , breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" , xlab="value of the variable", xlim=c(-10,20))

Histogram with colored tail

This example demonstrates how to color parts of the histogram. First of all, the hist function must be called without plotting the result using the plot=F option. It allows to store the position of each bin in an object (my_hist here).

Those bin borders are now available in the $breaks slot of the object, what allows to build a color vector using ifelse statements. Finally, this color vector can be used in a plot call.

# Create data

my_variable=rnorm(2000, 0 , 10)

# Calculate histogram, but do not draw it

my_hist=hist(my_variable , breaks=40 , plot=F)

# Color vector

my_color= ifelse(my_hist$breaks < -10, rgb(0.2,0.8,0.5,0.5) , ifelse (my_hist$breaks >=10, "purple", rgb(0.2,0.2,0.2,0.2) ))

# Final plot

plot(my_hist, col=my_color , border=F , main="" , xlab="value of the variable", xlim=c(-40,40) )

Search This Blog

S3PROGRAMMINGTECH

How to create histogram using ggplot2 in R Programming

A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar.

Basic histogram with geom_histogram

Arguments

Control bin size with binwidth

Mirror density chart with ggplot2

Histogram with geom_histogram

Histogram with several groups - ggplot2

Using small multiple

A good workaroung is to use small multiple where each group is represented in a fraction of the plot window, making the figure easy to read. This is pretty easy to build thanks to the facet_wrap() function of ggplot2.

Two Histograms with melt colors

Boxplot on top of histogram

Histogram with colored tail

Comments

Post a Comment

Popular posts from this blog

How to create Animated 3d chart with R.

Linux/Unix Commands frequently used

R Programming Introduction