How to create histogram using ggplot2 in R Programming
A histogram is an accurate graphical representation of the distribution of a numeric variable. It takes as input numeric variables only. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar.
Histograms can be built with ggplot2 thanks to
the geom_histogram() function. It requires only 1 numeric variable as
input. This function automatically cut the variable in bins and count the
number of data point per bin. Remember to try different bin size using the binwidth argument.
Basic histogram with geom_histogram
It is relatively straightforward
to build a histogram with ggplot2 thanks to the geom_histogram() function.
Only one numeric variable is needed in the input. Note that a warning message
is triggered with this code: we need to take care of the bin width as explained
in the next section.
Arguments
mapping
The
aesthetic mapping, usually constructed with aes or aes_string. Only needs to be set at the layer level if you are
overriding the plot defaults.
data
A layer
specific dataset - only needed if you want to override the plot defaults.
stat
The
statistical transformation to use on the data for this layer.
position
The
position adjustment to use for overlappling points on this layer
other
arguments passed on to layer. This can include aesthetics whose values you want to set,
not map.
|
|
If |
|
|
logical. Should this layer
be included in the legends? |
|
|
If |
|
|
The width of the bins. Can
be specified as a numeric value or as a function that calculates width from
unscaled x. Here, "unscaled x" refers to the original x values in
the data, before application of any scale transformation. When specifying a
function along with a grouping structure, the function will be called once
per group. The default is to use the number of bins in The bin width of a date
variable is the number of days in each time; the bin width of a time variable
is the number of seconds. |
|
|
Number of bins. Overridden
by |
|
|
The orientation of the
layer. The default ( |
|
|
Use to override the default
connection between |
|
|
bin position specifiers.
Only one, |
|
|
Alternatively, you can
supply a numeric vector giving the bin boundaries. Overrides |
|
|
One of |
|
|
If |
# library
library(ggplot2)
# dataset:
data=data.frame(value=rnorm(100))
# basic histogram
p <- ggplot(data, aes(x=value)) +
geom_histogram()
p
Control
bin size with binwidth
A histogram takes as input a
numeric variable and cuts it into several bins. Playing with the bin size is a
very important step, since its value can have a big impact on the histogram
appearance and thus on the message you’re trying to convey. Ggplot2 makes it a
breeze to change the bin size thanks to the binwidth argument of
the geom_histogram function.
# Libraries
library(tidyverse)
library(hrbrthemes)
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv",
header=TRUE)
p <- data %>%
filter(
price<300 ) %>%
ggplot( aes(x=price))
+
geom_histogram(
binwidth=3, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Bin
size = 3") +
theme_ipsum()
+
theme(
plot.title = element_text(size=15)
)
p
Mirror density chart with ggplot2
A density chart is
built thanks to the geom_density geom of ggplot2 (see a basic example).
It is possible to plot this density upside down by specifying y =
-..density. It is advised to use geom_label to indicate variable
names.
# Librarieslibrary(ggplot2)library(hrbrthemes) # Dummy datadata <-data.frame(
var1 =rnorm(1000),
var2 =rnorm(1000,mean=2)
) # Chartp <-ggplot(data,aes(x=x) )+
# Topgeom_density(aes(x =var1,y =..density..),fill="#69b3a2")+
geom_label(aes(x=4.5,y=0.25,label="variable1"),color="#69b3a2")+
# Bottomgeom_density(aes(x =var2,y =-..density..),fill="#404080")+
geom_label(aes(x=4.5,y=-0.25,label="variable2"),color="#404080")+
theme_ipsum()+
xlab("value of x")
p
Histogram
with geom_histogram
Of course it is possible to apply
exactly the same technique using geom_histogram instead
of geom_density to get a mirror histogram:
# Chart
p <- ggplot(data, aes(x=x) ) +
geom_histogram(
aes(x = var1, y = ..density..), fill="#69b3a2" ) +
geom_label( aes(x=4.5,
y=0.25, label="variable1"), color="#69b3a2") +
geom_histogram(
aes(x = var2, y = -..density..), fill= "#404080") +
geom_label( aes(x=4.5,
y=-0.25, label="variable2"), color="#404080") +
theme_ipsum()
+
xlab("value
of x")
Histogram with several groups - ggplot2
If the number of group or variable
you have is relatively low, you can display all of them on the same axis, using
a bit of transparency to make sure you do not hide any data.
Note: with 2 groups, you
can also build a mirror
histogram
# librarylibrary(ggplot2)library(dplyr)library(hrbrthemes)# Build dataset with different distributionsdata <-data.frame(
type =c(rep("variable 1",1000),rep("variable 2",1000) ),
value =c(rnorm(1000),rnorm(1000,mean=4) )
)# Represent itp <-data%>%
ggplot(aes(x=value,fill=type))+
geom_histogram(color="#e9ecef",alpha=0.6,position ='identity')+
scale_fill_manual(values=c("#69b3a2","#404080"))+
theme_ipsum()+
labs(fill="")
p
Using small multiple
If the number of group you need to
represent is high, drawing them on the same axis often results in a cluttered and
unreadable figure.
A good workaroung is to use small multiple where
each group is represented in a fraction of the plot window, making the figure
easy to read. This is pretty easy to build thanks to the facet_wrap() function
of ggplot2.
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv",
header=TRUE, sep=",")
data <- data %>%
gather(key="text",
value="value") %>%
mutate(text =
gsub("\\.", " ",text)) %>%
mutate(value
= round(as.numeric(value),0))
# plot
p <- data %>%
mutate(text =
fct_reorder(text, value)) %>%
ggplot( aes(x=value,
color=text, fill=text)) +
geom_histogram(alpha=0.6,
binwidth = 5) +
scale_fill_viridis(discrete=TRUE)
+
scale_color_viridis(discrete=TRUE)
+
theme_ipsum()
+
theme(
legend.position="none",
panel.spacing = unit(0.1,
"lines"),
strip.text.x = element_text(size
= 8)
) +
xlab("")
+
ylab("Assigned
Probability (%)") +
facet_wrap(~text)
Two Histograms with melt colors
Histograms are
commonly used in data analysis to observe distribution of variables. A common
task in data visualization is to compare the distribution of 2 variables
simultaneously.
Here is a tip to plot 2 histograms
together (using the add function) with transparency (using the rgb function)
to keep information when shapes overlap.
#Create data
set.seed(1)
Ixos=rnorm(4000 , 120 , 30)
Primadur=rnorm(4000 , 200 , 30)
# First
distribution
hist(Ixos,
breaks=30, xlim=c(0,300), col=rgb(1,0,0,0.5),
xlab="height",
ylab="nbr of
plants", main="distribution of height of 2 durum wheat
varieties" )
# Second with add=T to plot on top
hist(Primadur,
breaks=30, xlim=c(0,300), col=rgb(0,0,1,0.5), add=T)
# Add legend
legend("topright",
legend=c("Ixos","Primadur"), col=c(rgb(1,0,0,0.5),
rgb(0,0,1,0.5)),
pt.cex=2, pch=15 )
par(
mfrow=c(1,2),
mar=c(4,4,1,0)
)
hist(Ixos,
breaks=30 , xlim=c(0,300) , col=rgb(1,0,0,0.5) ,
xlab="height" , ylab="nbr of plants" , main="" )
hist(Primadur,
breaks=30 , xlim=c(0,300) , col=rgb(0,0,1,0.5) ,
xlab="height" , ylab="" , main="")
Boxplot on top of histogram
This example illustrates how to
split the plotting window in base R thanks to the layout function.
Contrary to the par(mfrow=...) solution, layout() allows
greater control of panel parts.
Here a boxplot is added
on top of the histogram, allowing to quickly observing summary statistics of the distribution.
# Create data
my_variable=c(rnorm(1000 , 0 , 2) , rnorm(1000
, 9 , 2))
# Layout to split
the screen
layout(mat = matrix(c(1,2),2,1,
byrow=TRUE), height = c(1,8))
# Draw the boxplot
and the histogram
par(mar=c(0,
3.1, 1.1, 2.1))
boxplot(my_variable
, horizontal=TRUE , ylim=c(-10,20), xaxt="n" , col=rgb(0.8,0.8,0,0.5)
, frame=F)
par(mar=c(4,
3.1, 1.1, 2.1))
hist(my_variable
, breaks=40 , col=rgb(0.2,0.8,0.5,0.5) , border=F , main="" ,
xlab="value of the variable", xlim=c(-10,20))
Histogram with colored tail
This example demonstrates how to
color parts of the histogram.
First of all, the hist function must be called without plotting the
result using the plot=F option. It allows to store the position of
each bin in an object (my_hist here).
Those bin borders are now
available in the $breaks slot of the object, what allows to build a
color vector using ifelse statements. Finally, this color vector can
be used in a plot call.
# Create data
my_variable=rnorm(2000, 0 , 10)
# Calculate histogram, but do not draw it
my_hist=hist(my_variable , breaks=40 , plot=F)
# Color vector
my_color= ifelse(my_hist$breaks < -10, rgb(0.2,0.8,0.5,0.5)
, ifelse (my_hist$breaks >=10, "purple", rgb(0.2,0.2,0.2,0.2)
))
# Final plot
plot(my_hist,
col=my_color , border=F , main="" , xlab="value of the
variable", xlim=c(-40,40) )
Comments
Post a Comment