'Encoding and raw in R

I not sure if this is a bug or not. If I encode one of the characters to UTF-8 before converting to raw and back again, then the characters are not the same. I have set default encoding to "UTF-8" in RStudio.

rawToChar(charToRaw(enc2utf8("vægt")))
[1] "vægt"

rawToChar(charToRaw("vægt"))
[1] "vægt"

Here is my sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggthemes_2.2.1  TTR_0.23-0      lubridate_1.3.3 tidyr_0.2.0     skm_1.0.2       ggplot2_1.0.1   dplyr_0.4.3    
[8] stringr_1.0.0   dkstat_0.08    

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1      rstudioapi_0.3.1 magrittr_1.5     MASS_7.3-43      munsell_0.4.2    lattice_0.20-33 
[7] colorspace_1.2-6 R6_2.1.1         httr_1.0.0       plyr_1.8.3       xts_0.9-7        tools_3.2.2     
[13] parallel_3.2.2   grid_3.2.2       gtable_0.1.2     DBI_0.3.1        lazyeval_0.1.10  assertthat_0.1  
[19] digest_0.6.8     reshape2_1.4.1   curl_0.9.3       memoise_0.2.1    labeling_0.3     stringi_0.5-5   
[25] scales_0.3.0     jsonlite_0.9.17  zoo_1.7-12       proto_0.3-10    


Solution 1:[1]

Here's my basic understanding of what's going on.

First some encoding facts:

                  Encoding
character    UTF-8        CP1252
   v         76             76
   æ         c3 a6          e6
   g         67             67
   t         74             74
   Ã         c3 83          c3
   ¦         c2 a6          a6

Now the mechanics:

The Windows machine uses the CP1252 encoding as can be seen from the sessionInfo output. So the vægt string in the R script is represented as the bytes 76 e6 67 74. This is confirmed by charToRaw("vægt"). If we then convert it to UTF-8, we get 76 c3 a6 67 74. The fact that these bytes represent UTF-8 is lost. Later rawToChar() converts these bytes back to a string, again assuming CP1252. Since c3 is à and a6 is ¦ in CP1252, we get vægt.

On Mac and Linux, on the other hand, the default encoding is UTF-8 throughout and the encoding mismatches do not occur. I suspect, however, that the same phenomenon as on Windows could be triggered by explicitly changing/setting the encoding used by R.

I don't think this is a bug.

Solution 2:[2]

On Windows in a non-UTF-8 locale, you can use stri_encode from the stringi package to convert raw bytes back to the correct characters and encoding,

stringi::stri_encode(charToRaw(enc2utf8("vægt")), from = "UTF-8", to = "UTF-8")
[1] "vægt"

From the documentation of charToRaw (emphasis added),

charToRaw converts a length-one character string to raw bytes. It does so without taking into account any declared encoding

Presumably rawToChar ignores original encodings in the same fashion. The stringi package on the other hand advertises

stringi ... is THE R package for very fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Niels