'Encoding and raw in R
I not sure if this is a bug or not. If I encode one of the characters to UTF-8 before converting to raw and back again, then the characters are not the same. I have set default encoding to "UTF-8" in RStudio.
rawToChar(charToRaw(enc2utf8("vægt")))
[1] "vægt"
rawToChar(charToRaw("vægt"))
[1] "vægt"
Here is my sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggthemes_2.2.1 TTR_0.23-0 lubridate_1.3.3 tidyr_0.2.0 skm_1.0.2 ggplot2_1.0.1 dplyr_0.4.3
[8] stringr_1.0.0 dkstat_0.08
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 rstudioapi_0.3.1 magrittr_1.5 MASS_7.3-43 munsell_0.4.2 lattice_0.20-33
[7] colorspace_1.2-6 R6_2.1.1 httr_1.0.0 plyr_1.8.3 xts_0.9-7 tools_3.2.2
[13] parallel_3.2.2 grid_3.2.2 gtable_0.1.2 DBI_0.3.1 lazyeval_0.1.10 assertthat_0.1
[19] digest_0.6.8 reshape2_1.4.1 curl_0.9.3 memoise_0.2.1 labeling_0.3 stringi_0.5-5
[25] scales_0.3.0 jsonlite_0.9.17 zoo_1.7-12 proto_0.3-10
Solution 1:[1]
Here's my basic understanding of what's going on.
First some encoding facts:
Encoding
character UTF-8 CP1252
v 76 76
æ c3 a6 e6
g 67 67
t 74 74
à c3 83 c3
¦ c2 a6 a6
Now the mechanics:
The Windows machine uses the CP1252 encoding as can be seen from the sessionInfo
output. So the vægt
string in the R script is represented as the bytes 76 e6 67 74
. This is confirmed by charToRaw("vægt")
. If we then convert it to UTF-8, we get 76 c3 a6 67 74
. The fact that these bytes represent UTF-8 is lost. Later rawToChar()
converts these bytes back to a string, again assuming CP1252. Since c3
is Ã
and a6
is ¦
in CP1252, we get vægt
.
On Mac and Linux, on the other hand, the default encoding is UTF-8 throughout and the encoding mismatches do not occur. I suspect, however, that the same phenomenon as on Windows could be triggered by explicitly changing/setting the encoding used by R.
I don't think this is a bug.
Solution 2:[2]
On Windows in a non-UTF-8 locale, you can use stri_encode from the stringi package to convert raw bytes back to the correct characters and encoding,
stringi::stri_encode(charToRaw(enc2utf8("vægt")), from = "UTF-8", to = "UTF-8")
[1] "vægt"
From the documentation of charToRaw (emphasis added),
charToRaw converts a length-one character string to raw bytes. It does so without taking into account any declared encoding
Presumably rawToChar
ignores original encodings in the same fashion. The stringi package on the other hand advertises
stringi ... is THE R package for very fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Niels |