'How does ggplot2 split groups using cut_number if you have a small number of data points?

I am wondering about the "behind-the-scenes functionality of ggplot2 and cut_number. I have some pretty complex data that has many subsets of data with a relatively small number of data points. I attached a graph as an example:

geom_boxplot with facets

I set up my cut here to have five groups, which you can see happened fine for a handful of the plots, but many others only formed 3 boxplots. I am wondering if cut_number tries to make plots with a minimum number of samples per category? I am suspicious about this based on the idea that boxplots are sometimes recommended to have a minimum of 5 data points each, and I am assuming that others believe that number should be even higher. If this is true, is there a way I can change how many groups it tries to make? I would prefer to use this over cut_interval as the varying width of these plots is informative (even if it makes some data look poor).

For example the third column from the left has 14, 40, and 16 samples in 2015, 2016, and 2018 respectively. I would have assumed cut_number would have tried to form 4 boxplots at a minimum for 2016.

I am not sure if my code is relevant, but I attached a slightly pared down version here in case there is a chance I have made a mistake somewhere and am accidentally getting the wrong functionality from cut_number.

My code:

gg <- ggplot(data = LakeCombine, aes(x = JulianDate, y = log10(Concentration)))

gg + geom_boxplot(aes(group = cut_number(JulianDate, 5)), outlier.size = 2) + 
facet_grid(rows = vars(Year), cols = vars(Lake)) + 
scale_x_continuous(breaks = c(121,182,244, 305), labels = c("May","Jul","Sep","Nov"), limits = c(100,325)) + 
geom_point(data = LakeCombineZero, aes(x = JulianDate, y = Concentration), shape = 1)


Solution 1:[1]

As @sarah mentioned it's a bit hard from the data you provide to give you a certain answer but...

Yes cut_number has some internal logic about size of the bins. It's not as simple as total size (N=5 or any other number) it also has to do with relative size. In my experience it triggers an error message (see below) but that may be getting lost in the facetting workflow in you case. To give you a simple example let's use diamonds from ggplot2

Notice we can make 27 bins based on our very large sample of > 53,000 diamonds depth but not 28, and it's clearly not the case any N is going to approach a small number. You can play with your own data based on the same methodology and then determine whether you want certain cuts to be forced manually or pick a bin size that works for every combination.

Chuck

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
str(diamonds)
#> tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
#>  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#>  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
#>  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
#>  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
#>  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#>  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
#>  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
#>  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#>  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#>  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
cut_number(diamonds$depth, n = 10) %>% table
#> .
#>     [43,60]   (60,60.8] (60.8,61.2] (61.2,61.6] (61.6,61.8] (61.8,62.1] 
#>        5625        5671        4953        6878        3981        6422 
#> (62.1,62.4] (62.4,62.7] (62.7,63.3]   (63.3,79] 
#>        5771        4366        5386        4887
cut_number(diamonds$depth, n = 27) %>% table
#> .
#>     [43,59]   (59,59.7] (59.7,60.1] (60.1,60.4] (60.4,60.7] (60.7,60.9] 
#>        2067        2182        1926        1829        2369        1987 
#> (60.9,61.1] (61.1,61.2] (61.2,61.4] (61.4,61.5] (61.5,61.6] (61.6,61.7] 
#>        2463        1426        3203        1719        1956        1904 
#> (61.7,61.8] (61.8,61.9]   (61.9,62]   (62,62.1] (62.1,62.2] (62.2,62.3] 
#>        2077        2163        2239        2020        2039        1940 
#> (62.3,62.4] (62.4,62.5] (62.5,62.6] (62.6,62.8] (62.8,62.9] (62.9,63.2] 
#>        1792        1563        1497        2539        1096        2352 
#> (63.2,63.5]   (63.5,64]     (64,79] 
#>        1847        1875        1870

cut_number(diamonds$depth, n = 28) %>% table
#> Error: Insufficient data values to produce 28 bins.

Created on 2020-04-17 by the reprex package (v0.3.0)

Solution 2:[2]

You can try the santoku package which has a family of chop_* functions as alternative to cut_*. Specifically chop_equally() does not return an error when not enough observation are available.

library(ggplot2)
library(santoku)

## Returns an error:
cut_number(mtcars$cyl, n = 5)
# Error in `cut_number()`:
#   ! Insufficient data values to produce 5 bins.
# Run `rlang::last_error()` to see where the error occurred.

## Returns a vector with maximum number of groups possible:
chop_equally(mtcars$cyl, groups = 5, labels = lbl_discrete())
# [1] 6—8 6—8 4—5 6—8 6—8 6—8 6—8 4—5 4—5 6—8 6—8 6—8 6—8 6—8 6—8 6—8 6—8 4—5 4—5 4—5 4—5 6—8 6—8 6—8
# [25] 6—8 4—5 4—5 4—5 6—8 6—8 6—8 4—5
# Levels: 4—5 6—8

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Chuck P
Solution 2 yogevmh