'Overlay KDE and filled histogram with ggplot2 (R)
I'm quite new in R and I'm struggling overlaying a filled histogram divided in 6 classes and a KDE based on the whole distribution (not the individual distributions of the 6 classes).
I have this dataset with 4 columns (data1
, data2
, data3
, origin
) with all data being continuous and origin being my categories (geographical locations). I'm fine with plotting the histogram for data1 with the 6 classes but when I'm adding the KDE curve, it's also divided in 6 curves (one for each class). I think I understand I have to override the first aes
argument and make a new one when I call geom_density
, but I can't find how to do so.
Translating my problem with the iris dataset, I would like the KDE curve for the Sepal.Length
and not one KDE curve Sepal.Length
for each species. Here is my code and my results with iris data.
ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
geom_histogram() +
theme_minimal() +
geom_density(kernel="gaussian", bw= 0.1, alpha=.3)
Solution 1:[1]
The problem is that the histogram displays counts, which integrates to the sum, and the density plot shows, well, density, that integrates to 1. To make the two compatible you'd have to use the 'computed variables' of the stat parts of the layers, which are accessible with after_stat()
. You can either scale the density such that it integrates to the sum, or you can scale the histogram such that it integrates to 1.
Scaling the histogram to the density:
library(ggplot2)
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(aes(y = after_stat(density)),
position = 'identity') +
geom_density(bw = 0.1, alpha = 0.3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scaling density to counts. To do this properly you should multiply the count
computed variable with the binwidth
parameter of the histogram.
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2, position = 'identity') +
geom_density(aes(y = after_stat(count * 0.2)),
bw = 0.1, alpha = 0.3)
Created on 2021-06-22 by the reprex package (v1.0.0)
As a side note; the default position argument for the histogram is to stack bars on top of oneanother. Setting position = "identity"
prevents this. Alternatively, you could also set position = "stack"
in the density layer.
EDIT: Sorry I've seem to have glossed over the 'I want 1 KDE for the entire Sepal.Length
'-part of the question. You'd have to manually set the group, like so:
ggplot(iris, aes(Sepal.Length, fill = Species)) +
geom_histogram(binwidth = 0.2) +
geom_density(bw = 0.1, alpha = 0.3,
aes(group = 1, y = after_stat(count * 0.2)))
Solution 2:[2]
I also found a nice tutorial on combining geom_hist() and geom_density() with matching scale on sthda.com
Reprex from there is:
set.seed(1234)
df <- data.frame(
sex=factor(rep(c("F", "M"), each=200)),
weight=round(c(rnorm(200, mean=55, sd=5),
rnorm(200, mean=65, sd=5)))
)
library(ggplot2)
ggplot(df, aes(x=weight, color=sex, fill=sex)) +
geom_histogram(aes(y=..density..), alpha=0.5,position="identity") +
geom_density(alpha=.2)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | André |