'Retain all columns after using group_by summarise, and mutatue dplyr on categorical variable and plot barplot with confidence intervals
I'm new to R.
This is my dataset
df <- tribble( ~Area_of_interst ,~Meds,~Response,
"Internal Med", "asprin", "yes",
"Internal Med", "vitamins","no",
"Internal Med", "folic acid","yes",
"Emergency Med", "asprin", "yes",
"Emergency Med", "vitamins","no",
"Emergency Med", "folic acid","yes",
I have about 6 different "Area_of_interest". As you can all my variables are categorical. I want to plot a barplot for all the 6 different "Area_of_interest" by meds whiles only filtering those with response "yes" on the same barplot. The barplot should have their respective confidence interval.
I have two questions:
After I used the summarise function, I didn't access to the variable "Area of interest". All the variables are categorical. How do I compute the proportions without using summarise function or I do fix my code below for me to retain all my columns
Compute my confidence interval for barplot for each "area of interest".
df %>% na.omit() %>%
group_by(meds, Response) %>% summarise( ct=n()) %>%
mutate(propn =paste0( round(100*ct/sum(ct),1),"%" )) %>%
filter(Response=="yes") %>% ggplot(aes(x=meds, y=propn)) +
geom_col(position = "dodge")
Solution 1:[1]
To answer your two questions
Summarize will run summary metrics within each group : which means it is summarizing rows within each group and returns only 1 row for each group - hence all other variables which are not creating the groups will be removed. For your goal, you want to retain all three of your variables as groups when you run summary
Note that for calculating the proportions, you will need to drop the
Response
variable from the grouping for the next operation where the % of yes and no will be calculated. Unless I misunderstood your percentage metric..I wouldn't recommend rounding and adding a percentage to the variable, you should be able to do that while formatting the plot using the
scales
package as recommended in this post or this stackoverflowReg confidence intervals, if I understood your question right, confidence intervals are not relevant in your plot since you are plotting a single data point and not any summarizing central tendency operations like mean/median etc. from a distribution. If you had to, you can use
geom_errorbar
to show confidence intervals
For the plot, I would recommend splitting into multiple facets
for clear visibility since you have 6 categories. So use either colour or facets from my answer and remove the other one.
Here's the code -
library(tidyverse)
# load data
df <- tribble( ~Area_of_interest ,~Meds,~Response,
"Internal Med", "asprin", "yes",
"Internal Med", "asprin", "yes",
"Internal Med", "asprin", "no",
"Internal Med", "vitamins","no",
"Internal Med", "folic acid","yes",
"Emergency Med", "asprin", "yes",
"Emergency Med", "vitamins","no",
"Emergency Med", "folic acid","yes")
df %>%
na.omit() %>%
group_by(Area_of_interest, Meds, Response) %>%
summarise(ct=n(), .groups = 'drop_last') %>% # removes Response from the grouping variable for the next operation
# proportion = % of 'yes'/'no' within every Area for each Med
mutate(proportion = 100 * ct/sum(ct)) %>% # Note: mutate conducts operation within each group, which decides the sum(ct)
filter(Response=="yes") %>%
ggplot(aes(x=Meds, y=proportion, fill = Area_of_interest)) +
geom_col(position = "dodge") +
# optionally, separate areas of interest into sub-panels for better visual clarity and remove the colouring with 'fill'
facet_grid(rows = 'Area_of_interest')
Created on 2022-05-16 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |