'Density plot based on time of the day
I've the following dataset:
https://app.box.com/s/au58xaw60r1hyeek5cua6q20byumgvmj
I want to create a density plot based on the time of the day. Here is what I've done so far:
library("ggplot2")
library("scales")
library("lubridate")
timestamp_df$timestamp_time <- format(ymd_hms(hn_tweets$timestamp), "%H:%M:%S")
ggplot(timestamp_df, aes(timestamp_time)) +
geom_density(aes(fill = ..count..)) +
scale_x_datetime(breaks = date_breaks("2 hours"),labels=date_format("%H:%M"))
It gives the following error:
Error: Invalid input: time_trans works with objects of class POSIXct only
If I convert that to POSIXct
, it adds dates to the data.
Update 1
The following converted data to 'NA'
timestamp_df$timestamp_time <- as.POSIXct(timestamp_df$timestamp_time, format = "%H:%M%:%S", tz = "UTC"
Update 2
Solution 1:[1]
Here is one approach:
library(ggplot2)
library(lubridate)
library(scales)
df <- read.csv("data.csv") #given in OP
convert character to POSIXct
df$timestamp <- as.POSIXct(strptime(df$timestamp, "%m/%d/%Y %H:%M", tz = "UTC"))
library(hms)
extract hour and minute:
df$time <- hms::hms(second(df$timestamp), minute(df$timestamp), hour(df$timestamp))
convert to POSIXct
again since ggplot does not work with class hms
.
df$time <- as.POSIXct(df$time)
ggplot(df, aes(time)) +
geom_density(fill = "red", alpha = 0.5) + #also play with adjust such as adjust = 0.5
scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))
to plot it scaled to 1:
ggplot(df) +
geom_density( aes(x = time, y = ..scaled..), fill = "red", alpha = 0.5) +
scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))
where ..scaled..
is a computed variable for stat_density
made during plot creation.
Solution 2:[2]
One problem with the solutions posted here is that they ignore the fact that this data is circular/polar (i.e. 00hrs == 24hrs). You can see on the plots on the other answer that the ends of the charts dont match up with each other. This wont make too much of a difference with this particular dataset, but for events that happen near midnight, this could be an extremely biased estimator of density. Here's my solution, taking into account the circular nature of time data:
# modified code from https://freakonometrics.hypotheses.org/2239
library(dplyr)
library(ggplot2)
library(lubridate)
library(circular)
df = read.csv("data.csv")
datetimes = df$timestamp %>%
lubridate::parse_date_time("%m/%d/%Y %h:%M")
times_in_decimal = lubridate::hour(datetimes) + lubridate::minute(datetimes) / 60
times_in_radians = 2 * pi * (times_in_decimal / 24)
# Doing this just for bandwidth estimation:
basic_dens = density(times_in_radians, from = 0, to = 2 * pi)
res = circular::density.circular(circular::circular(times_in_radians,
type = "angle",
units = "radians",
rotation = "clock"),
kernel = "wrappednormal",
bw = basic_dens$bw)
time_pdf = data.frame(time = as.numeric(24 * (2 * pi + res$x) / (2 * pi)), # Convert from radians back to 24h clock
likelihood = res$y)
p = ggplot(time_pdf) +
geom_area(aes(x = time, y = likelihood), fill = "#619CFF") +
scale_x_continuous("Hour of Day", labels = 0:24, breaks = 0:24) +
scale_y_continuous("Likelihood of Data") +
theme_classic()
Note that the values and slopes of the density plot match up at the 00h and 24h points.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |