'Mahalanobis difference by group with dplyr

I want to get a Mahalanobis difference for each set of two scores, after being grouped by another variable. In this case, it would be a Mahalanobis difference for each Attribute (across each set of 2 scores). The output should be 3 Mahalanobis distances (one for A, B and C).

Currently I am working with (in my original dataframe, there are some NAs, hence I include one in the reprex):

library(tidyverse)
library(purrr)



df <- tibble(Attribute = unlist(map(LETTERS[1:3], rep, 5)),
             Score1 = c(runif(7), NA, runif(7)),
             Score2 = runif(15))

mah_db <- df %>% 
  dplyr::group_by(Attribute) %>% 
  dplyr::summarise(MAH = mahalanobis(Score1:Score2, 
                                     center = base::colMeans(Score1:Score2), 
                                     cov(Score1:Score2, use = "pairwise.complete.obs")))

This raises the error:

Caused by error in base::colMeans(): ! 'x' must be an array of at least two dimensions

But as far as I can tell, I am giving colMeans two columns.

So what's going wrong here? And I wonder if even fixing this gives a complete solution?



Solution 1:[1]

It seems your question is more about the statistics than dplyr. So I just give a small example based on your data and an adapted example from ?mahalanobis. Perhaps also have a look here or here.

df <- subset(x = df0, Attribute == "A", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
plot(density(df$mahalanobis, bw = 0.3), ylim = c(0, 0.8),
     main="Squared Mahalanobis distances"); 
grid()
rug(df$mahalanobis)

df <- subset(x = df0, Attribute == "B", select = c("Score1", "Score2"))
df <- df[complete.cases(df), ]
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "red",
     main="Squared Mahalanobis distances"); 
rug(df$mahalanobis, col = "red")

df <- subset(x = df0, Attribute == "C", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "green",
     main="Squared Mahalanobis distances"); 
rug(df$mahalanobis, col = "green")

Hope, that helps (and too long for a comment).
(Of course you can make to code much shorter, but it shows in each step what happens.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Christoph