'Overlay KDE and filled histogram with ggplot2 (R)

I'm quite new in R and I'm struggling overlaying a filled histogram divided in 6 classes and a KDE based on the whole distribution (not the individual distributions of the 6 classes). I have this dataset with 4 columns (data1, data2, data3, origin) with all data being continuous and origin being my categories (geographical locations). I'm fine with plotting the histogram for data1 with the 6 classes but when I'm adding the KDE curve, it's also divided in 6 curves (one for each class). I think I understand I have to override the first aes argument and make a new one when I call geom_density, but I can't find how to do so.

Translating my problem with the iris dataset, I would like the KDE curve for the Sepal.Length and not one KDE curve Sepal.Length for each species. Here is my code and my results with iris data.

ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) +
    geom_histogram() +
    theme_minimal() +
    geom_density(kernel="gaussian", bw= 0.1, alpha=.3)

Example with Iris dataset



Solution 1:[1]

The problem is that the histogram displays counts, which integrates to the sum, and the density plot shows, well, density, that integrates to 1. To make the two compatible you'd have to use the 'computed variables' of the stat parts of the layers, which are accessible with after_stat(). You can either scale the density such that it integrates to the sum, or you can scale the histogram such that it integrates to 1.

Scaling the histogram to the density:

library(ggplot2)
ggplot(iris, aes(Sepal.Length, fill = Species)) +
  geom_histogram(aes(y = after_stat(density)),
                 position = 'identity') +
  geom_density(bw = 0.1, alpha = 0.3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scaling density to counts. To do this properly you should multiply the count computed variable with the binwidth parameter of the histogram.

ggplot(iris, aes(Sepal.Length, fill = Species)) +
  geom_histogram(binwidth = 0.2, position = 'identity') +
  geom_density(aes(y = after_stat(count * 0.2)),
               bw = 0.1, alpha = 0.3)

Created on 2021-06-22 by the reprex package (v1.0.0)

As a side note; the default position argument for the histogram is to stack bars on top of oneanother. Setting position = "identity" prevents this. Alternatively, you could also set position = "stack" in the density layer.

EDIT: Sorry I've seem to have glossed over the 'I want 1 KDE for the entire Sepal.Length'-part of the question. You'd have to manually set the group, like so:

ggplot(iris, aes(Sepal.Length, fill = Species)) +
  geom_histogram(binwidth = 0.2) +
  geom_density(bw = 0.1, alpha = 0.3, 
               aes(group = 1, y = after_stat(count * 0.2)))

Solution 2:[2]

I also found a nice tutorial on combining geom_hist() and geom_density() with matching scale on sthda.com

http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization#combine-histogram-and-density-plots

Reprex from there is:

set.seed(1234)
df <- data.frame(
  sex=factor(rep(c("F", "M"), each=200)),
  weight=round(c(rnorm(200, mean=55, sd=5),
                 rnorm(200, mean=65, sd=5)))
  ) 
library(ggplot2) 
ggplot(df, aes(x=weight, color=sex, fill=sex)) + 
 geom_histogram(aes(y=..density..), alpha=0.5,position="identity") +
 geom_density(alpha=.2) 

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 André