'Density plot based on time of the day

I've the following dataset:

https://app.box.com/s/au58xaw60r1hyeek5cua6q20byumgvmj

I want to create a density plot based on the time of the day. Here is what I've done so far:

library("ggplot2")
library("scales")
library("lubridate")

timestamp_df$timestamp_time <- format(ymd_hms(hn_tweets$timestamp), "%H:%M:%S")

ggplot(timestamp_df, aes(timestamp_time)) + 
       geom_density(aes(fill = ..count..)) +
       scale_x_datetime(breaks = date_breaks("2 hours"),labels=date_format("%H:%M"))

It gives the following error: Error: Invalid input: time_trans works with objects of class POSIXct only

If I convert that to POSIXct, it adds dates to the data.

Update 1

The following converted data to 'NA'

timestamp_df$timestamp_time <- as.POSIXct(timestamp_df$timestamp_time, format = "%H:%M%:%S", tz = "UTC"

Update 2

Following is what I want to achieve: enter image description here



Solution 1:[1]

Here is one approach:

library(ggplot2)
library(lubridate)
library(scales)

df <- read.csv("data.csv") #given in OP

convert character to POSIXct

df$timestamp <- as.POSIXct(strptime(df$timestamp, "%m/%d/%Y %H:%M",  tz = "UTC"))

library(hms)

extract hour and minute:

df$time <- hms::hms(second(df$timestamp), minute(df$timestamp), hour(df$timestamp))  

convert to POSIXct again since ggplot does not work with class hms.

df$time <- as.POSIXct(df$time)


ggplot(df, aes(time)) + 
  geom_density(fill = "red", alpha = 0.5) + #also play with adjust such as adjust = 0.5
  scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))

enter image description here

to plot it scaled to 1:

ggplot(df) + 
  geom_density( aes(x = time, y = ..scaled..), fill = "red", alpha = 0.5) +
  scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))

where ..scaled.. is a computed variable for stat_density made during plot creation.

enter image description here

Solution 2:[2]

One problem with the solutions posted here is that they ignore the fact that this data is circular/polar (i.e. 00hrs == 24hrs). You can see on the plots on the other answer that the ends of the charts dont match up with each other. This wont make too much of a difference with this particular dataset, but for events that happen near midnight, this could be an extremely biased estimator of density. Here's my solution, taking into account the circular nature of time data:

# modified code from https://freakonometrics.hypotheses.org/2239

library(dplyr)
library(ggplot2)
library(lubridate)
library(circular)

df = read.csv("data.csv")
datetimes = df$timestamp %>%
  lubridate::parse_date_time("%m/%d/%Y %h:%M")
times_in_decimal = lubridate::hour(datetimes) + lubridate::minute(datetimes) / 60
times_in_radians = 2 * pi * (times_in_decimal / 24)

# Doing this just for bandwidth estimation:
basic_dens = density(times_in_radians, from = 0, to = 2 * pi)

res = circular::density.circular(circular::circular(times_in_radians,
                                                    type = "angle",
                                                    units = "radians",
                                                    rotation = "clock"),
                                 kernel = "wrappednormal",
                                 bw = basic_dens$bw)

time_pdf = data.frame(time = as.numeric(24 * (2 * pi + res$x) / (2 * pi)), # Convert from radians back to 24h clock
                      likelihood = res$y)

p = ggplot(time_pdf) +
  geom_area(aes(x = time, y = likelihood), fill = "#619CFF") +
  scale_x_continuous("Hour of Day", labels = 0:24, breaks = 0:24) +
  scale_y_continuous("Likelihood of Data") +
  theme_classic()

Density Plot considering circular data

Note that the values and slopes of the density plot match up at the 00h and 24h points.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2