'What is the meaning of these error messages in running pivot_wider() in RStudio?

I'm a newbie in R. Is there anyone who can help me?

I import a CSV of extract of stackoverflow data from,

s <- read_csv("https://www.ics.uci.edu/~duboisc/stackoverflow/answers.csv")

Then, I separate different values in 'tags' column into rows,

ss1 <- separate_rows(ss, tags)

Then, I apply pivot_wider() on 'tags' column,

ss2 <- pivot_wider(ss1, names_from = tags, values_from = qs)

The following error messages are shown,

Error: Internal error in compact_rep(): Negative n in compact_rep(). Run rlang::last_error() to see where the error occurred. In addition: Warning messages: 1: Values are not uniquely identified; output will contain list-cols.

  • Use values_fn = list to suppress this warning.
  • Use values_fn = length to identify where the duplicates arise
  • Use values_fn = {summary_fun} to summarise duplicates 2: In nrow * ncol : NAs produced by integer overflow

I have searched the different keywords in these messages but am not able to find out the overall meaning of these errors. Is there anyone who can help me? Thanks.



Solution 1:[1]

@Anoushiravan R:

Thank you very much for your kind suggestion again.

With your suggestion, I find these error messages,

> ss1 <- s %>%
+     separate_rows(tags) %>% 
+     select(qs, tags) %>%
+     group_by(tags) %>%
+     mutate(id = row_number()) %>%
+     ungroup() %>%
+     mutate(tags = if_else(tags == "", "unknown", tags))
> ss2 <- ss1 %>% pivot_wider(names_from = tags, values_from = qs, names_repair = "minimal")

Error: cannot allocate vector of size 5.4 Gb

Before, I always get another error message In nrow * ncol : NAs produced by integer overflow.

Then, I google In nrow * ncol : NAs produced by integer overflow and find that it may be in relation to the console pane. See https://github.com/wrathematics/float/issues/17

Also, I remove all the objects/datasets in "global environment" and restart RS, now I get the result as yours.

As I want to include ALL columns in the result, I remove "select(qs, tags) %>%" from your suggestion with the following codes and errors,

> ss1 <- s %>%
+     separate_rows(tags) %>% 
+     
+     group_by(tags) %>%
+     mutate(id = row_number()) %>%
+     ungroup() %>%
+     mutate(tags = if_else(tags == "", "unknown", tags))
> View(ss1)
> ss2 <- ss1 %>% pivot_wider(names_from = tags, values_from = qs, names_repair = "minimal")

Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In nrow * ncol : NAs produced by integer overflow

The In nrow * ncol : NAs produced by integer overflow appears again.

I google the first major error, Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep() and cannot find a good answer.

I also try different combination with "group_by" but cannot get a satisfactory result. Anyway thank you very much for your help.

Solution 2:[2]

Ok I edited my solution, I hope this is something you were looking for. This time I used separate_rows as per your suggestion to separate the values stacked in every rows in tags column. Run the following code and then let me know if there is anything else you need.

s %>%
  separate_rows(tags) %>% 
  select(qs, tags) %>%
  group_by(tags) %>%
  mutate(id = row_number()) %>%
  ungroup() %>%
  mutate(tags = if_else(tags == "", "unknown", tags)) %>%
  pivot_wider(names_from = tags, values_from = qs, names_repair = "minimal")


# A tibble: 68,384 x 10,522
      id   php error    gd image processing  lisp scheme subjective clojure cocoa touch
   <int> <dbl> <dbl> <dbl> <dbl>      <dbl> <dbl>  <dbl>      <dbl>   <dbl> <dbl> <dbl>
 1     1     0     0     0     0          0    10     10         10      10     0     0
 2     2     0     0     0     0          0    10     10         10      10     0     0
 3     3     1     0     1     1          1    10     10         10      10     0     0
 4     4     1     2     0     1          1    10     10         10      10     1     1
 5     5     1     2     0     1          1    10     10         10      10     0     0
 6     6     2     2     1     1          1    10     10         10      10     1     1
 7     7     2     2     1     0          1    10     10         10      10     1     1
 8     8     2     2     0     0          1    10     10         10      10     3     3
 9     9     0     2     0     0          1    10     10         10      10     3     3
10    10     0     2     0     0          1    10     10         10      10     3     3
# ... with 68,374 more rows, and 10,510 more variables

Since data here is a bit heavy I suggest you first run the code until pivot_wider and then run pivot_wider line. I don't know why but only in this way I get the desired output otherwise I receives an error.

Solution 3:[3]

This is a bug in R, or a limitation, whatever we call it there is no direct solution for it. This is the essence of the error:

`a <- 1000000L
 b <- 2000000L 
 a * b` 

It yields NA with a warning: In a * b : NAs produced by integer overflow

I have circumvented the issue by a new approach, not as neat as direct as using separate_row() and then `pivot_longer(), but it works!

This is the idea:

  1. find all the unique (hash)tags save them in a vector
  2. loop through the vector and str_detect() the elements in the original text
  3. You will have a logical vector for each tag as the result of 2, bind_cols() them.

Actually 2&3 are implemented in a loop.

For 1, you can use the separate_row() and then distinct() the tags column, then pull it out of the tbl.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 sspoldtwo
Solution 2
Solution 3 Shaahin