'How does ggplot2 split groups using cut_number if you have a small number of data points?
I am wondering about the "behind-the-scenes functionality of ggplot2
and cut_number
. I have some pretty complex data that has many subsets of data with a relatively small number of data points. I attached a graph as an example:
I set up my cut here to have five groups, which you can see happened fine for a handful of the plots, but many others only formed 3 boxplots. I am wondering if cut_number tries to make plots with a minimum number of samples per category? I am suspicious about this based on the idea that boxplots are sometimes recommended to have a minimum of 5 data points each, and I am assuming that others believe that number should be even higher. If this is true, is there a way I can change how many groups it tries to make? I would prefer to use this over cut_interval
as the varying width of these plots is informative (even if it makes some data look poor).
For example the third column from the left has 14, 40, and 16 samples in 2015, 2016, and 2018 respectively. I would have assumed cut_number
would have tried to form 4 boxplots at a minimum for 2016.
I am not sure if my code is relevant, but I attached a slightly pared down version here in case there is a chance I have made a mistake somewhere and am accidentally getting the wrong functionality from cut_number
.
My code:
gg <- ggplot(data = LakeCombine, aes(x = JulianDate, y = log10(Concentration)))
gg + geom_boxplot(aes(group = cut_number(JulianDate, 5)), outlier.size = 2) +
facet_grid(rows = vars(Year), cols = vars(Lake)) +
scale_x_continuous(breaks = c(121,182,244, 305), labels = c("May","Jul","Sep","Nov"), limits = c(100,325)) +
geom_point(data = LakeCombineZero, aes(x = JulianDate, y = Concentration), shape = 1)
Solution 1:[1]
As @sarah mentioned it's a bit hard from the data you provide to give you a certain answer but...
Yes cut_number
has some internal logic about size of the bins. It's not as simple as total size (N=5 or any other number) it also has to do with relative size. In my experience it triggers an error message (see below) but that may be getting lost in the facetting workflow in you case. To give you a simple example let's use diamonds
from ggplot2
Notice we can make 27 bins based on our very large sample of > 53,000 diamonds depth
but not 28, and it's clearly not the case any N is going to approach a small number. You can play with your own data based on the same methodology and then determine whether you want certain cuts to be forced manually or pick a bin size that works for every combination.
Chuck
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
str(diamonds)
#> tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
#> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#> $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
#> $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
#> $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
#> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
#> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
#> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
cut_number(diamonds$depth, n = 10) %>% table
#> .
#> [43,60] (60,60.8] (60.8,61.2] (61.2,61.6] (61.6,61.8] (61.8,62.1]
#> 5625 5671 4953 6878 3981 6422
#> (62.1,62.4] (62.4,62.7] (62.7,63.3] (63.3,79]
#> 5771 4366 5386 4887
cut_number(diamonds$depth, n = 27) %>% table
#> .
#> [43,59] (59,59.7] (59.7,60.1] (60.1,60.4] (60.4,60.7] (60.7,60.9]
#> 2067 2182 1926 1829 2369 1987
#> (60.9,61.1] (61.1,61.2] (61.2,61.4] (61.4,61.5] (61.5,61.6] (61.6,61.7]
#> 2463 1426 3203 1719 1956 1904
#> (61.7,61.8] (61.8,61.9] (61.9,62] (62,62.1] (62.1,62.2] (62.2,62.3]
#> 2077 2163 2239 2020 2039 1940
#> (62.3,62.4] (62.4,62.5] (62.5,62.6] (62.6,62.8] (62.8,62.9] (62.9,63.2]
#> 1792 1563 1497 2539 1096 2352
#> (63.2,63.5] (63.5,64] (64,79]
#> 1847 1875 1870
cut_number(diamonds$depth, n = 28) %>% table
#> Error: Insufficient data values to produce 28 bins.
Created on 2020-04-17 by the reprex package (v0.3.0)
Solution 2:[2]
You can try the santoku
package which has a family of chop_*
functions as alternative to cut_*
. Specifically chop_equally()
does not return an error when not enough observation are available.
library(ggplot2)
library(santoku)
## Returns an error:
cut_number(mtcars$cyl, n = 5)
# Error in `cut_number()`:
# ! Insufficient data values to produce 5 bins.
# Run `rlang::last_error()` to see where the error occurred.
## Returns a vector with maximum number of groups possible:
chop_equally(mtcars$cyl, groups = 5, labels = lbl_discrete())
# [1] 6—8 6—8 4—5 6—8 6—8 6—8 6—8 4—5 4—5 6—8 6—8 6—8 6—8 6—8 6—8 6—8 6—8 4—5 4—5 4—5 4—5 6—8 6—8 6—8
# [25] 6—8 4—5 4—5 4—5 6—8 6—8 6—8 4—5
# Levels: 4—5 6—8
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Chuck P |
Solution 2 | yogevmh |