'Retain all columns after using group_by summarise, and mutatue dplyr on categorical variable and plot barplot with confidence intervals

I'm new to R.

This is my dataset

df <- tribble( ~Area_of_interst ,~Meds,~Response, 
                 "Internal Med", "asprin", "yes",
                 "Internal Med", "vitamins","no",
                  "Internal Med", "folic acid","yes",
                  "Emergency Med", "asprin", "yes",
                 "Emergency Med", "vitamins","no",
                  "Emergency Med", "folic acid","yes",

I have about 6 different "Area_of_interest". As you can all my variables are categorical. I want to plot a barplot for all the 6 different "Area_of_interest" by meds whiles only filtering those with response "yes" on the same barplot. The barplot should have their respective confidence interval.

I have two questions:

After I used the summarise function, I didn't access to the variable "Area of interest". All the variables are categorical. How do I compute the proportions without using summarise function or I do fix my code below for me to retain all my columns
Compute my confidence interval for barplot for each "area of interest".

df %>% na.omit() %>% 
  group_by(meds, Response) %>% summarise( ct=n()) %>%
  mutate(propn =paste0( round(100*ct/sum(ct),1),"%" )) %>% 
  filter(Response=="yes") %>% ggplot(aes(x=meds, y=propn)) + 
  geom_col(position = "dodge")

r dplyr bar-chart confidence-interval

Solution 1:^[1]

To answer your two questions

Summarize will run summary metrics within each group : which means it is summarizing rows within each group and returns only 1 row for each group - hence all other variables which are not creating the groups will be removed. For your goal, you want to retain all three of your variables as groups when you run summary
Note that for calculating the proportions, you will need to drop the Response variable from the grouping for the next operation where the % of yes and no will be calculated. Unless I misunderstood your percentage metric..
I wouldn't recommend rounding and adding a percentage to the variable, you should be able to do that while formatting the plot using the scales package as recommended in this post or this stackoverflow
Reg confidence intervals, if I understood your question right, confidence intervals are not relevant in your plot since you are plotting a single data point and not any summarizing central tendency operations like mean/median etc. from a distribution. If you had to, you can use geom_errorbar to show confidence intervals

For the plot, I would recommend splitting into multiple facets for clear visibility since you have 6 categories. So use either colour or facets from my answer and remove the other one.

Here's the code -

library(tidyverse)

# load data 
df <- tribble( ~Area_of_interest ,~Meds,~Response, 
               "Internal Med", "asprin", "yes",
               "Internal Med", "asprin", "yes",
               "Internal Med", "asprin", "no",
               
               "Internal Med", "vitamins","no",
               "Internal Med", "folic acid","yes",
               "Emergency Med", "asprin", "yes",
               "Emergency Med", "vitamins","no",
               "Emergency Med", "folic acid","yes")
               

df %>% 
  na.omit() %>% 
  group_by(Area_of_interest, Meds, Response) %>% 
  
  summarise(ct=n(), .groups = 'drop_last') %>% # removes Response from the grouping variable for the next operation
  
  # proportion = % of 'yes'/'no' within every Area for each Med 
  mutate(proportion = 100 * ct/sum(ct)) %>% # Note: mutate conducts operation within each group, which decides the sum(ct)
  
  filter(Response=="yes") %>% 
  
  ggplot(aes(x=Meds, y=proportion, fill = Area_of_interest)) + 
  geom_col(position = "dodge") +
  
  # optionally, separate areas of interest into sub-panels for better visual clarity and remove the colouring with 'fill'
  facet_grid(rows = 'Area_of_interest')

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Retain all columns after using group_by summarise, and mutatue dplyr on categorical variable and plot barplot with confidence intervals

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]